LeViT-256 Vision Transformer

Property	Value
Parameter Count	18.9M
GMACs	1.1
Image Size	224 x 224
Top-1 Accuracy	81.512%
Paper	LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

What is levit_256.fb_dist_in1k?

LeViT-256 is a novel vision transformer architecture that combines the best aspects of transformers and convolutional neural networks. Developed by Facebook Research, this model was trained on ImageNet-1k using knowledge distillation techniques to achieve an optimal balance between performance and efficiency.

Implementation Details

The model implements a hybrid architecture using convolutional operations (nn.Conv2d and nn.BatchNorm2d) while maintaining transformer-like attention mechanisms. With 18.9M parameters and 1.1 GMACs, it delivers efficient inference while processing 224x224 images.

Optimized architecture combining CNN and transformer elements
Knowledge distillation training approach
4.2M activations for efficient processing
Balanced parameter count for mobile-friendly deployment

Core Capabilities

Image classification with 81.512% top-1 accuracy
Feature extraction backbone functionality
Efficient inference with reduced computational overhead
Suitable for both classification and embedding generation

Frequently Asked Questions

Q: What makes this model unique?

LeViT-256 stands out by incorporating convolutional operations into a transformer architecture, offering faster inference speeds while maintaining competitive accuracy. It represents a middle-ground solution in the LeViT family, balancing model size and performance.

Q: What are the recommended use cases?

This model is ideal for production environments requiring efficient image classification or feature extraction, particularly where computational resources are constrained but high accuracy is still necessary. It's well-suited for mobile applications and real-time processing scenarios.

levit_256.fb_dist_in1k