xclip-base-patch32

xclip-base-patch32

microsoft

A video-language understanding model with 197M parameters, achieving 80.4% top-1 accuracy on Kinetics-400. Built on CLIP architecture for video classification.

PropertyValue
Parameter Count197M
LicenseMIT
PaperView Paper
Top-1 Accuracy80.4%
Training DatasetKinetics-400

What is xclip-base-patch32?

X-CLIP base-patch32 is a sophisticated video-language understanding model that extends the capabilities of CLIP for video analysis. Developed by Microsoft, this model processes video content using 8 frames per video at 224x224 resolution, with a patch size of 32 pixels. It's designed for video-text matching and classification tasks, trained on the Kinetics-400 dataset.

Implementation Details

The model employs a transformer-based architecture that processes video frames through a patch-based approach. It uses contrastive learning to understand relationships between video content and textual descriptions.

  • Processes 8 frames per video input
  • Uses 224x224 resolution for frame analysis
  • Implements 32x32 pixel patches for processing
  • Achieves 95.0% top-5 accuracy on classification tasks

Core Capabilities

  • Zero-shot video classification
  • Few-shot learning capabilities
  • Video-text retrieval
  • Fully supervised video classification

Frequently Asked Questions

Q: What makes this model unique?

This model uniquely extends CLIP's vision-language capabilities to video understanding, offering strong performance on video classification tasks while maintaining a relatively compact parameter count of 197M.

Q: What are the recommended use cases?

The model excels at video classification tasks, video-text matching, and can be used for both zero-shot and few-shot learning scenarios. It's particularly effective for applications requiring understanding of video content in relation to textual descriptions.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026