LeViT-256 Vision Transformer
Property | Value |
---|---|
Parameter Count | 18.9M |
GMACs | 1.1 |
Image Size | 224 x 224 |
Top-1 Accuracy | 81.512% |
Paper | LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference |
What is levit_256.fb_dist_in1k?
LeViT-256 is a novel vision transformer architecture that combines the best aspects of transformers and convolutional neural networks. Developed by Facebook Research, this model was trained on ImageNet-1k using knowledge distillation techniques to achieve an optimal balance between performance and efficiency.
Implementation Details
The model implements a hybrid architecture using convolutional operations (nn.Conv2d and nn.BatchNorm2d) while maintaining transformer-like attention mechanisms. With 18.9M parameters and 1.1 GMACs, it delivers efficient inference while processing 224x224 images.
- Optimized architecture combining CNN and transformer elements
- Knowledge distillation training approach
- 4.2M activations for efficient processing
- Balanced parameter count for mobile-friendly deployment
Core Capabilities
- Image classification with 81.512% top-1 accuracy
- Feature extraction backbone functionality
- Efficient inference with reduced computational overhead
- Suitable for both classification and embedding generation
Frequently Asked Questions
Q: What makes this model unique?
LeViT-256 stands out by incorporating convolutional operations into a transformer architecture, offering faster inference speeds while maintaining competitive accuracy. It represents a middle-ground solution in the LeViT family, balancing model size and performance.
Q: What are the recommended use cases?
This model is ideal for production environments requiring efficient image classification or feature extraction, particularly where computational resources are constrained but high accuracy is still necessary. It's well-suited for mobile applications and real-time processing scenarios.