uniformer_image

uniformer_image

Sense-X

UniFormer is a powerful vision transformer that combines convolution and self-attention, achieving 86.3% top-1 accuracy on ImageNet-1K without extra training data.

PropertyValue
LicenseMIT
PaperUniFormer Paper
Training DataImageNet
Model Size (Small)22M parameters

What is uniformer_image?

UniFormer is an innovative vision transformer that uniquely combines the strengths of convolution and self-attention in a unified transformer architecture. Developed by researchers at Sense-X, it achieves remarkable performance on ImageNet-1K classification with 86.3% top-1 accuracy without requiring additional training data.

Implementation Details

The model implements a hybrid architecture using local MHRA (Multi-Head Relation Aggregation) in shallow layers to reduce computational complexity and global MHRA in deeper layers for learning global token relationships. The architecture comes in different variants, with UniFormer-S containing 22M parameters and UniFormer-B scaling up to 50M parameters.

  • Supports 224x224 resolution image input
  • Implements efficient local-global token mixing
  • Provides multiple model sizes for different computational requirements

Core Capabilities

  • Image Classification (86.3% top-1 accuracy on ImageNet-1K)
  • Video Classification (82.9/84.8% on Kinetics-400/600)
  • Object Detection (53.8% box AP on COCO)
  • Semantic Segmentation (50.8% mIoU on ADE20K)
  • Pose Estimation (77.4% AP on COCO)

Frequently Asked Questions

Q: What makes this model unique?

UniFormer's uniqueness lies in its ability to seamlessly integrate convolution and self-attention mechanisms in a transformer format, achieving state-of-the-art performance across multiple vision tasks without requiring additional training data beyond ImageNet-1K.

Q: What are the recommended use cases?

The model is particularly well-suited for image classification tasks, but its architecture makes it versatile enough for a wide range of computer vision applications, including video classification, object detection, semantic segmentation, and pose estimation.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026