MaxViT Small TF 512
Property | Value |
---|---|
Parameter Count | 69.1M |
Input Resolution | 512x512 |
Top-1 Accuracy | 86.10% |
License | Apache 2.0 |
Paper | MaxViT: Multi-Axis Vision Transformer |
What is maxvit_small_tf_512.in1k?
MaxViT Small TF 512 is a variant of the MaxViT architecture that combines the strengths of convolutional neural networks and transformers. Originally trained in TensorFlow and ported to PyTorch, this model is optimized for 512x512 image inputs and represents an efficient balance between model size and performance.
Implementation Details
The model implements a hybrid architecture with 69.1M parameters and requires 67.26 GMACs for inference. It features a unique combination of MBConv blocks and self-attention mechanisms across multiple axes, utilizing both window and grid attention patterns.
- Achieves 86.10% top-1 accuracy on ImageNet-1K
- Processes 88.63 samples per second during inference
- Uses 383.77M activations during forward pass
- Optimized for high-resolution 512x512 input images
Core Capabilities
- Image classification with 1000 classes (ImageNet)
- Feature extraction capabilities for downstream tasks
- Efficient handling of high-resolution images
- Balanced performance-to-parameter ratio
Frequently Asked Questions
Q: What makes this model unique?
This model uniquely combines MBConv blocks with multi-axis attention mechanisms, offering a balance between computational efficiency and accuracy. Its 512x512 resolution support makes it particularly suitable for applications requiring detailed image analysis.
Q: What are the recommended use cases?
The model is well-suited for high-resolution image classification tasks, transfer learning applications, and scenarios where a good balance between accuracy and computational efficiency is required. It's particularly effective for applications needing detailed feature extraction from larger images.