UniFormer Image Model

Property	Value
License	MIT
Paper	UniFormer Paper
Training Data	ImageNet
Model Size (Small)	22M parameters

What is uniformer_image?

UniFormer is an innovative vision transformer that uniquely combines the strengths of convolution and self-attention in a unified transformer architecture. Developed by researchers at Sense-X, it achieves remarkable performance on ImageNet-1K classification with 86.3% top-1 accuracy without requiring additional training data.

Implementation Details

The model implements a hybrid architecture using local MHRA (Multi-Head Relation Aggregation) in shallow layers to reduce computational complexity and global MHRA in deeper layers for learning global token relationships. The architecture comes in different variants, with UniFormer-S containing 22M parameters and UniFormer-B scaling up to 50M parameters.

Supports 224x224 resolution image input
Implements efficient local-global token mixing
Provides multiple model sizes for different computational requirements

Core Capabilities

Image Classification (86.3% top-1 accuracy on ImageNet-1K)
Video Classification (82.9/84.8% on Kinetics-400/600)
Object Detection (53.8% box AP on COCO)
Semantic Segmentation (50.8% mIoU on ADE20K)
Pose Estimation (77.4% AP on COCO)

Frequently Asked Questions

Q: What makes this model unique?

UniFormer's uniqueness lies in its ability to seamlessly integrate convolution and self-attention mechanisms in a transformer format, achieving state-of-the-art performance across multiple vision tasks without requiring additional training data beyond ImageNet-1K.

Q: What are the recommended use cases?

The model is particularly well-suited for image classification tasks, but its architecture makes it versatile enough for a wide range of computer vision applications, including video classification, object detection, semantic segmentation, and pose estimation.

uniformer_image