videomae-large-finetuned-kinetics

MCG-NJU

VideoMAE large model fine-tuned on Kinetics-400 dataset. 304M parameters, achieves 84.7% top-1 accuracy for video classification tasks. Built on masked autoencoder architecture.

Property	Value
Parameter Count	304M
License	CC-BY-NC-4.0
Paper	VideoMAE Paper
Accuracy	84.7% (Top-1), 96.5% (Top-5)
Framework	PyTorch

What is videomae-large-finetuned-kinetics?

VideoMAE is a sophisticated video classification model that extends the Masked Autoencoder (MAE) approach to video understanding. This large variant has been pre-trained for 1600 epochs using self-supervised learning and then fine-tuned on the Kinetics-400 dataset. The model processes videos as sequences of 16x16 fixed-size patches and employs a Vision Transformer architecture with additional decoder capabilities.

Implementation Details

The model architecture builds upon the Vision Transformer (ViT) framework, incorporating several key technical innovations:

Uses a [CLS] token for classification tasks
Employs sinus/cosinus position embeddings
Processes video inputs as 16x16 patch sequences
Implements a transformer encoder-decoder architecture
Utilizes masked autoencoding for self-supervised learning

Core Capabilities

Video classification across 400 Kinetics categories
Feature extraction for downstream tasks
High-accuracy prediction (84.7% top-1)
Efficient processing of video sequences
Robust representation learning

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its application of masked autoencoding to video data, achieving state-of-the-art performance while using an efficient self-supervised pre-training approach. The large parameter count (304M) and impressive accuracy metrics make it particularly suitable for complex video understanding tasks.

Q: What are the recommended use cases?

The model is primarily designed for video classification tasks, particularly within the Kinetics-400 dataset categories. It's well-suited for applications requiring high-accuracy video understanding, such as content categorization, action recognition, and video indexing systems.