maxvit_large_tf_512.in1k

maxvit_large_tf_512.in1k

timm

Large-scale vision transformer (213M params) combining convolution and attention mechanisms, optimized for 512x512 images with 86.52% top-1 accuracy.

PropertyValue
Parameter Count213M
Top-1 Accuracy86.52%
LicenseApache 2.0
PaperMaxViT: Multi-Axis Vision Transformer
Input Resolution512x512

What is maxvit_large_tf_512.in1k?

MaxViT Large is a sophisticated vision transformer model that combines the strengths of convolution operations with multi-axis attention mechanisms. Originally trained in TensorFlow and ported to PyTorch, this model represents a significant advancement in vision transformer architecture, designed to process high-resolution images at 512x512 pixels.

Implementation Details

The model implements a hybrid architecture that utilizes both MBConv (mobile inverted bottleneck) blocks and dual-path attention mechanisms. With 212.33M parameters and 244.75 GMACs, it offers a balance between computational efficiency and model performance.

  • Combines convolutional blocks with window and grid attention mechanisms
  • Features a large-scale architecture optimized for 512x512 input resolution
  • Implements LayerNorm for normalization throughout the network
  • Achieves 86.52% top-1 accuracy on ImageNet-1K dataset

Core Capabilities

  • High-resolution image classification with state-of-the-art performance
  • Feature extraction for downstream computer vision tasks
  • Efficient processing of large images through multi-axis attention
  • Balanced trade-off between computational cost and accuracy

Frequently Asked Questions

Q: What makes this model unique?

This model introduces a novel multi-axis attention mechanism that processes visual information across different spatial partitioning schemes, combining the benefits of both local and global attention patterns with convolutional operations.

Q: What are the recommended use cases?

The model is particularly well-suited for high-resolution image classification tasks, computer vision applications requiring detailed feature extraction, and scenarios where processing larger images is necessary while maintaining high accuracy.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026