MambaVision-L3-512-21K

MambaVision-L3-512-21K

nvidia

MambaVision-L3-512-21K is a groundbreaking hybrid vision model combining Mamba and Transformer architectures, achieving 88.1% Top-1 accuracy on ImageNet-1K with 739.6M parameters

PropertyValue
Parameter Count739.6M
FLOPs489.1G
Resolution512x512
Top-1 Accuracy88.1%
LicenseNVIDIA Source Code License-NC

What is MambaVision-L3-512-21K?

MambaVision-L3-512-21K represents a pioneering achievement in computer vision, introducing the first hybrid architecture that combines Mamba and Transformer technologies. Developed by NVIDIA, this model has been pretrained on ImageNet-21K and fine-tuned on ImageNet-1K, operating at a 512x512 resolution. The architecture demonstrates state-of-the-art performance by achieving an impressive 88.1% Top-1 accuracy.

Implementation Details

The model implements a hierarchical architecture that uniquely combines Mamba's efficient sequence modeling with Transformer's self-attention capabilities. The implementation includes specific optimizations for visual feature modeling and strategic placement of self-attention blocks in the final layers to enhance spatial dependency capture.

  • Hierarchical architecture with 4 distinct stages
  • Custom Mamba formulation optimized for visual features
  • Integration of self-attention blocks in final layers
  • Supports flexible input resolutions
  • Provides both classification and feature extraction capabilities

Core Capabilities

  • Image Classification with 1000 classes
  • Feature extraction with multiple output stages
  • Averaged pool features output (1568-dimensional)
  • Multi-stage feature extraction (4 stages)
  • High throughput performance

Frequently Asked Questions

Q: What makes this model unique?

This model is the first to successfully combine Mamba architecture with Vision Transformers, creating a hybrid approach that leverages the strengths of both architectures. The integration of self-attention blocks at specific layers optimizes the model's ability to capture long-range spatial dependencies while maintaining computational efficiency.

Q: What are the recommended use cases?

The model is well-suited for high-resolution image classification tasks and as a backbone for various computer vision applications. It's particularly effective when used for feature extraction in downstream tasks, offering multiple stages of features with varying spatial resolutions.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026