MambaVision-L3-512-21K

Property	Value
Parameter Count	739.6M
FLOPs	489.1G
Resolution	512x512
Top-1 Accuracy	88.1%
License	NVIDIA Source Code License-NC

What is MambaVision-L3-512-21K?

MambaVision-L3-512-21K represents a pioneering achievement in computer vision, introducing the first hybrid architecture that combines Mamba and Transformer technologies. Developed by NVIDIA, this model has been pretrained on ImageNet-21K and fine-tuned on ImageNet-1K, operating at a 512x512 resolution. The architecture demonstrates state-of-the-art performance by achieving an impressive 88.1% Top-1 accuracy.

Implementation Details

The model implements a hierarchical architecture that uniquely combines Mamba's efficient sequence modeling with Transformer's self-attention capabilities. The implementation includes specific optimizations for visual feature modeling and strategic placement of self-attention blocks in the final layers to enhance spatial dependency capture.

Hierarchical architecture with 4 distinct stages
Custom Mamba formulation optimized for visual features
Integration of self-attention blocks in final layers
Supports flexible input resolutions
Provides both classification and feature extraction capabilities

Core Capabilities

Image Classification with 1000 classes
Feature extraction with multiple output stages
Averaged pool features output (1568-dimensional)
Multi-stage feature extraction (4 stages)
High throughput performance

Frequently Asked Questions

Q: What makes this model unique?

This model is the first to successfully combine Mamba architecture with Vision Transformers, creating a hybrid approach that leverages the strengths of both architectures. The integration of self-attention blocks at specific layers optimizes the model's ability to capture long-range spatial dependencies while maintaining computational efficiency.

Q: What are the recommended use cases?

The model is well-suited for high-resolution image classification tasks and as a backbone for various computer vision applications. It's particularly effective when used for feature extraction in downstream tasks, offering multiple stages of features with varying spatial resolutions.