LeViT-128 Vision Transformer
Property | Value |
---|---|
Parameter Count | 9.21M |
Model Type | Vision Transformer (ConvNet-style) |
License | Apache-2.0 |
Image Size | 224x224 |
Top-1 Accuracy | 78.474% |
GMACs | 0.4 |
What is levit_128.fb_dist_in1k?
LeViT-128 is a vision transformer architecture designed by Facebook Research that combines the benefits of transformers with convolutional neural networks. It's specifically optimized for faster inference while maintaining competitive accuracy. This model represents a balanced compromise between model size and performance, with 9.21M parameters and 78.474% top-1 accuracy on ImageNet-1k.
Implementation Details
The model implements a hybrid architecture that uses convolutional operations (nn.Conv2d and nn.BatchNorm2d) while maintaining transformer-like attention mechanisms. It's been trained using knowledge distillation on the ImageNet-1k dataset, achieving efficient feature extraction with only 0.4 GMACs of computational requirement.
- Optimized activation size of 2.7M
- Efficient inference architecture
- Distillation-based training approach
- Convolutional-style implementation for better hardware utilization
Core Capabilities
- Image classification with 1000 classes
- Feature extraction backbone functionality
- Efficient inference on standard hardware
- Balanced performance-to-size ratio
Frequently Asked Questions
Q: What makes this model unique?
LeViT-128 stands out for its hybrid approach that combines transformer architecture with convolutional operations, optimized specifically for inference speed while maintaining good accuracy. It represents an excellent balance between model size and performance.
Q: What are the recommended use cases?
This model is particularly well-suited for production environments where inference speed is crucial but accuracy cannot be significantly compromised. It's ideal for real-time image classification tasks, feature extraction, and as a backbone for more complex computer vision applications.