MambaVision-L3-512-21K
Property | Value |
---|---|
Parameter Count | 739.6M |
FLOPs | 489.1G |
Resolution | 512x512 |
Top-1 Accuracy | 88.1% |
License | NVIDIA Source Code License-NC |
What is MambaVision-L3-512-21K?
MambaVision-L3-512-21K represents a pioneering achievement in computer vision, introducing the first hybrid architecture that combines Mamba and Transformer technologies. Developed by NVIDIA, this model has been pretrained on ImageNet-21K and fine-tuned on ImageNet-1K, operating at a 512x512 resolution. The architecture demonstrates state-of-the-art performance by achieving an impressive 88.1% Top-1 accuracy.
Implementation Details
The model implements a hierarchical architecture that uniquely combines Mamba's efficient sequence modeling with Transformer's self-attention capabilities. The implementation includes specific optimizations for visual feature modeling and strategic placement of self-attention blocks in the final layers to enhance spatial dependency capture.
- Hierarchical architecture with 4 distinct stages
- Custom Mamba formulation optimized for visual features
- Integration of self-attention blocks in final layers
- Supports flexible input resolutions
- Provides both classification and feature extraction capabilities
Core Capabilities
- Image Classification with 1000 classes
- Feature extraction with multiple output stages
- Averaged pool features output (1568-dimensional)
- Multi-stage feature extraction (4 stages)
- High throughput performance
Frequently Asked Questions
Q: What makes this model unique?
This model is the first to successfully combine Mamba architecture with Vision Transformers, creating a hybrid approach that leverages the strengths of both architectures. The integration of self-attention blocks at specific layers optimizes the model's ability to capture long-range spatial dependencies while maintaining computational efficiency.
Q: What are the recommended use cases?
The model is well-suited for high-resolution image classification tasks and as a backbone for various computer vision applications. It's particularly effective when used for feature extraction in downstream tasks, offering multiple stages of features with varying spatial resolutions.