LeViT-128 Vision Transformer

Property	Value
Parameter Count	9.21M
Model Type	Vision Transformer (ConvNet-style)
License	Apache-2.0
Image Size	224x224
Top-1 Accuracy	78.474%
GMACs	0.4

What is levit_128.fb_dist_in1k?

LeViT-128 is a vision transformer architecture designed by Facebook Research that combines the benefits of transformers with convolutional neural networks. It's specifically optimized for faster inference while maintaining competitive accuracy. This model represents a balanced compromise between model size and performance, with 9.21M parameters and 78.474% top-1 accuracy on ImageNet-1k.

Implementation Details

The model implements a hybrid architecture that uses convolutional operations (nn.Conv2d and nn.BatchNorm2d) while maintaining transformer-like attention mechanisms. It's been trained using knowledge distillation on the ImageNet-1k dataset, achieving efficient feature extraction with only 0.4 GMACs of computational requirement.

Optimized activation size of 2.7M
Efficient inference architecture
Distillation-based training approach
Convolutional-style implementation for better hardware utilization

Core Capabilities

Image classification with 1000 classes
Feature extraction backbone functionality
Efficient inference on standard hardware
Balanced performance-to-size ratio

Frequently Asked Questions

Q: What makes this model unique?

LeViT-128 stands out for its hybrid approach that combines transformer architecture with convolutional operations, optimized specifically for inference speed while maintaining good accuracy. It represents an excellent balance between model size and performance.

Q: What are the recommended use cases?

This model is particularly well-suited for production environments where inference speed is crucial but accuracy cannot be significantly compromised. It's ideal for real-time image classification tasks, feature extraction, and as a backbone for more complex computer vision applications.

levit_128.fb_dist_in1k