UniFormer Image Model
Property | Value |
---|---|
License | MIT |
Paper | UniFormer Paper |
Training Data | ImageNet |
Model Size (Small) | 22M parameters |
What is uniformer_image?
UniFormer is an innovative vision transformer that uniquely combines the strengths of convolution and self-attention in a unified transformer architecture. Developed by researchers at Sense-X, it achieves remarkable performance on ImageNet-1K classification with 86.3% top-1 accuracy without requiring additional training data.
Implementation Details
The model implements a hybrid architecture using local MHRA (Multi-Head Relation Aggregation) in shallow layers to reduce computational complexity and global MHRA in deeper layers for learning global token relationships. The architecture comes in different variants, with UniFormer-S containing 22M parameters and UniFormer-B scaling up to 50M parameters.
- Supports 224x224 resolution image input
- Implements efficient local-global token mixing
- Provides multiple model sizes for different computational requirements
Core Capabilities
- Image Classification (86.3% top-1 accuracy on ImageNet-1K)
- Video Classification (82.9/84.8% on Kinetics-400/600)
- Object Detection (53.8% box AP on COCO)
- Semantic Segmentation (50.8% mIoU on ADE20K)
- Pose Estimation (77.4% AP on COCO)
Frequently Asked Questions
Q: What makes this model unique?
UniFormer's uniqueness lies in its ability to seamlessly integrate convolution and self-attention mechanisms in a transformer format, achieving state-of-the-art performance across multiple vision tasks without requiring additional training data beyond ImageNet-1K.
Q: What are the recommended use cases?
The model is particularly well-suited for image classification tasks, but its architecture makes it versatile enough for a wide range of computer vision applications, including video classification, object detection, semantic segmentation, and pose estimation.