CrossViT-9 240 ImageNet Model

Property	Value
Parameter Count	8.55M
Image Size	240x240
License	Apache 2.0
Paper	CrossViT Paper
Dataset	ImageNet-1k

What is crossvit_9_240.in1k?

CrossViT-9 is an innovative vision transformer model that implements a cross-attention multi-scale architecture for image classification. Developed by IBM researchers, this model represents a lightweight implementation with 8.55M parameters, specifically designed to process 240x240 pixel images while maintaining efficient computational requirements of just 1.8 GMACs.

Implementation Details

The model employs a unique dual-branch architecture that processes images at multiple scales simultaneously. It achieves this through cross-attention mechanisms that allow information exchange between different resolution pathways, resulting in more robust feature extraction.

Multi-scale processing with cross-attention mechanism
Efficient architecture with 8.6M parameters
9.5M activations for feature processing
Optimized for 240x240 resolution inputs

Core Capabilities

High-quality image classification on ImageNet-1k dataset
Feature extraction for downstream tasks
Efficient processing with reduced computational overhead
Support for both classification and embedding generation

Frequently Asked Questions

Q: What makes this model unique?

CrossViT-9's distinctive feature is its cross-attention mechanism that enables effective multi-scale processing while maintaining a compact parameter count. This makes it particularly efficient for real-world applications where computational resources may be limited.

Q: What are the recommended use cases?

The model is well-suited for image classification tasks, particularly when working with fixed 240x240 resolution images. It can be used for both direct classification and as a feature extractor for transfer learning applications, with support for both full classification and embedding generation workflows.

crossvit_9_240.in1k