videomae-large-finetuned-kinetics

videomae-large-finetuned-kinetics

MCG-NJU

VideoMAE large model fine-tuned on Kinetics-400 dataset. 304M parameters, achieves 84.7% top-1 accuracy for video classification tasks. Built on masked autoencoder architecture.

PropertyValue
Parameter Count304M
LicenseCC-BY-NC-4.0
PaperVideoMAE Paper
Accuracy84.7% (Top-1), 96.5% (Top-5)
FrameworkPyTorch

What is videomae-large-finetuned-kinetics?

VideoMAE is a sophisticated video classification model that extends the Masked Autoencoder (MAE) approach to video understanding. This large variant has been pre-trained for 1600 epochs using self-supervised learning and then fine-tuned on the Kinetics-400 dataset. The model processes videos as sequences of 16x16 fixed-size patches and employs a Vision Transformer architecture with additional decoder capabilities.

Implementation Details

The model architecture builds upon the Vision Transformer (ViT) framework, incorporating several key technical innovations:

  • Uses a [CLS] token for classification tasks
  • Employs sinus/cosinus position embeddings
  • Processes video inputs as 16x16 patch sequences
  • Implements a transformer encoder-decoder architecture
  • Utilizes masked autoencoding for self-supervised learning

Core Capabilities

  • Video classification across 400 Kinetics categories
  • Feature extraction for downstream tasks
  • High-accuracy prediction (84.7% top-1)
  • Efficient processing of video sequences
  • Robust representation learning

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its application of masked autoencoding to video data, achieving state-of-the-art performance while using an efficient self-supervised pre-training approach. The large parameter count (304M) and impressive accuracy metrics make it particularly suitable for complex video understanding tasks.

Q: What are the recommended use cases?

The model is primarily designed for video classification tasks, particularly within the Kinetics-400 dataset categories. It's well-suited for applications requiring high-accuracy video understanding, such as content categorization, action recognition, and video indexing systems.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026