videomae-base

MCG-NJU

VideoMAE base model with 94.2M params for self-supervised video pre-training. Uses masked autoencoding on Kinetics-400 dataset with ViT architecture.

Property	Value
Parameter Count	94.2M
License	CC-BY-NC-4.0
Paper	VideoMAE Paper
Framework	PyTorch
Tensor Type	F32

What is videomae-base?

VideoMAE-base is a self-supervised video pre-training model that extends the Masked Autoencoder (MAE) approach to video processing. Developed by MCG-NJU, this model has been pre-trained on the Kinetics-400 dataset for 1600 epochs, utilizing a Vision Transformer (ViT) architecture to process video data effectively.

Implementation Details

The model processes videos as sequences of 16x16 fixed-size patches with a unique approach to masked prediction. It incorporates a [CLS] token for classification tasks and utilizes sinus/cosinus position embeddings. The architecture consists of a Transformer encoder with a specialized decoder designed for predicting pixel values in masked patches.

Self-supervised pre-training approach
Vision Transformer-based architecture
16x16 patch-based video processing
Integrated [CLS] token for classification

Core Capabilities

Video feature extraction and representation learning
Masked patch prediction for self-supervised learning
Fine-tunable for downstream video classification tasks
Efficient processing of video temporal information

Frequently Asked Questions

Q: What makes this model unique?

VideoMAE's uniqueness lies in its efficient self-supervised learning approach for video understanding, requiring no manual annotations during pre-training while achieving strong performance through masked autoencoding.

Q: What are the recommended use cases?

The model is primarily designed for video understanding tasks, particularly after fine-tuning. It's suitable for video classification, feature extraction, and can be adapted for various video analysis tasks through transfer learning.