UniFormer Image Model
Property | Value |
---|---|
License | MIT |
Paper | View Paper |
Architecture | Vision Transformer |
Training Data | ImageNet |
What is uniformer_image?
UniFormer is an innovative vision transformer model that uniquely combines the strengths of convolution and self-attention mechanisms. Developed by Sense-X, it achieves an impressive 86.3% top-1 accuracy on ImageNet-1K classification without requiring additional training data. The model operates at a 224x224 resolution and comes in various sizes, with the base model containing 50M parameters.
Implementation Details
The model implements a hybrid architecture that uses local MHRA (Multi-Head Relation Aggregation) in shallow layers to reduce computational complexity and global MHRA in deeper layers to capture broader token relationships. This design choice creates an efficient balance between local feature processing and global context understanding.
- UniFormer-S: 22M parameters, 3.6G FLOPs, 82.9% accuracy
- UniFormer-B: 50M parameters, 8.3G FLOPs, 83.8% accuracy
- Integrated convolution and self-attention mechanisms
Core Capabilities
- Image Classification (ImageNet-1K)
- Transfer Learning for downstream tasks
- Object Detection (53.8 box AP on COCO)
- Semantic Segmentation (50.8 mIoU on ADE20K)
- Pose Estimation (77.4 AP on COCO)
Frequently Asked Questions
Q: What makes this model unique?
UniFormer's uniqueness lies in its ability to seamlessly integrate convolution and self-attention mechanisms within a transformer architecture, providing excellent performance across various visual recognition tasks while maintaining computational efficiency.
Q: What are the recommended use cases?
The model is particularly well-suited for image classification tasks, but its architecture makes it versatile enough for various computer vision applications including object detection, semantic segmentation, and pose estimation. It's especially valuable when high accuracy is required without access to extensive training data beyond ImageNet.