MaxViT Small TF 512

Property	Value
Parameter Count	69.1M
Input Resolution	512x512
Top-1 Accuracy	86.10%
License	Apache 2.0
Paper	MaxViT: Multi-Axis Vision Transformer

What is maxvit_small_tf_512.in1k?

MaxViT Small TF 512 is a variant of the MaxViT architecture that combines the strengths of convolutional neural networks and transformers. Originally trained in TensorFlow and ported to PyTorch, this model is optimized for 512x512 image inputs and represents an efficient balance between model size and performance.

Implementation Details

The model implements a hybrid architecture with 69.1M parameters and requires 67.26 GMACs for inference. It features a unique combination of MBConv blocks and self-attention mechanisms across multiple axes, utilizing both window and grid attention patterns.

Achieves 86.10% top-1 accuracy on ImageNet-1K
Processes 88.63 samples per second during inference
Uses 383.77M activations during forward pass
Optimized for high-resolution 512x512 input images

Core Capabilities

Image classification with 1000 classes (ImageNet)
Feature extraction capabilities for downstream tasks
Efficient handling of high-resolution images
Balanced performance-to-parameter ratio

Frequently Asked Questions

Q: What makes this model unique?

This model uniquely combines MBConv blocks with multi-axis attention mechanisms, offering a balance between computational efficiency and accuracy. Its 512x512 resolution support makes it particularly suitable for applications requiring detailed image analysis.

Q: What are the recommended use cases?

The model is well-suited for high-resolution image classification tasks, transfer learning applications, and scenarios where a good balance between accuracy and computational efficiency is required. It's particularly effective for applications needing detailed feature extraction from larger images.