MambaVision-L3-512-21K

Maintained By
nvidia

MambaVision-L3-512-21K

PropertyValue
Parameter Count739.6M
FLOPs489.1G
Resolution512x512
Top-1 Accuracy88.1%
LicenseNVIDIA Source Code License-NC

What is MambaVision-L3-512-21K?

MambaVision-L3-512-21K represents a pioneering achievement in computer vision, introducing the first hybrid architecture that combines Mamba and Transformer technologies. Developed by NVIDIA, this model has been pretrained on ImageNet-21K and fine-tuned on ImageNet-1K, operating at a 512x512 resolution. The architecture demonstrates state-of-the-art performance by achieving an impressive 88.1% Top-1 accuracy.

Implementation Details

The model implements a hierarchical architecture that uniquely combines Mamba's efficient sequence modeling with Transformer's self-attention capabilities. The implementation includes specific optimizations for visual feature modeling and strategic placement of self-attention blocks in the final layers to enhance spatial dependency capture.

  • Hierarchical architecture with 4 distinct stages
  • Custom Mamba formulation optimized for visual features
  • Integration of self-attention blocks in final layers
  • Supports flexible input resolutions
  • Provides both classification and feature extraction capabilities

Core Capabilities

  • Image Classification with 1000 classes
  • Feature extraction with multiple output stages
  • Averaged pool features output (1568-dimensional)
  • Multi-stage feature extraction (4 stages)
  • High throughput performance

Frequently Asked Questions

Q: What makes this model unique?

This model is the first to successfully combine Mamba architecture with Vision Transformers, creating a hybrid approach that leverages the strengths of both architectures. The integration of self-attention blocks at specific layers optimizes the model's ability to capture long-range spatial dependencies while maintaining computational efficiency.

Q: What are the recommended use cases?

The model is well-suited for high-resolution image classification tasks and as a backbone for various computer vision applications. It's particularly effective when used for feature extraction in downstream tasks, offering multiple stages of features with varying spatial resolutions.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.