xcit_tiny_24_p8_384.fb_dist_in1k

xcit_tiny_24_p8_384.fb_dist_in1k

timm

XCiT (Cross-Covariance Image Transformer) image classification model with 12.1M parameters, optimized for 384x384 images with distillation training on ImageNet-1k.

PropertyValue
Parameter Count12.1M
Image Size384 x 384
LicenseApache-2.0
PaperXCiT: Cross-Covariance Image Transformers
GMACs27.1

What is xcit_tiny_24_p8_384.fb_dist_in1k?

This is a specialized implementation of the Cross-Covariance Image Transformer (XCiT) architecture, designed for high-performance image classification tasks. Developed by Facebook Research, this model represents a lightweight variant with 12.1M parameters, optimized for processing 384x384 pixel images. The model has been pre-trained on ImageNet-1k using knowledge distillation techniques to maintain high accuracy while reducing model size.

Implementation Details

The model utilizes the innovative XCiT architecture, which introduces cross-covariance attention mechanisms to process image data efficiently. With 27.1 GMACs and 133.0M activations, it offers a balanced trade-off between computational efficiency and model performance.

  • Efficient patch-based image processing with P8 patch size
  • 24-layer architecture optimized for 384x384 resolution
  • Distillation-based training for improved performance
  • Support for both classification and feature extraction tasks

Core Capabilities

  • Image classification on ImageNet-1k dataset
  • Feature backbone extraction for downstream tasks
  • Efficient processing of high-resolution images
  • Support for both inference and feature embedding generation

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its implementation of cross-covariance attention mechanisms, which provide efficient processing of high-resolution images while maintaining a relatively small parameter count. The distillation training approach further enhances its performance-to-size ratio.

Q: What are the recommended use cases?

The model is particularly well-suited for image classification tasks requiring high-resolution input (384x384), feature extraction for transfer learning, and scenarios where a balance between model size and performance is crucial. It's ideal for applications requiring both accuracy and computational efficiency.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026