videomae-large

MCG-NJU

VideoMAE large - A 343M parameter video transformer model for masked autoencoding, pre-trained on Kinetics-400 for self-supervised learning

Property	Value
Parameter Count	343M
License	CC-BY-NC-4.0
Paper	VideoMAE Paper
Framework	PyTorch

What is videomae-large?

VideoMAE-large is an advanced self-supervised learning model designed for video understanding tasks. It extends the Masked Autoencoder (MAE) approach to video processing, utilizing a large-scale architecture with 343M parameters. Pre-trained on the Kinetics-400 dataset for 1600 epochs, it represents a significant advancement in video representation learning.

Implementation Details

The model processes videos as sequences of 16x16 fixed-size patches, incorporating a Vision Transformer (ViT) architecture with additional decoder capabilities. It utilizes a [CLS] token for classification tasks and employs sinus/cosinus position embeddings.

Large-scale architecture with 343M parameters
Self-supervised pre-training on Kinetics-400
16x16 patch-based video processing
Transformer-based encoding with specialized decoder

Core Capabilities

Masked video patch prediction
Feature extraction for downstream tasks
Video representation learning
Transfer learning potential for various video tasks

Frequently Asked Questions

Q: What makes this model unique?

VideoMAE-large stands out for its self-supervised learning approach that doesn't require labeled data for pre-training, making it highly efficient for video understanding tasks. Its large parameter count and specialized architecture enable robust feature learning from masked video content.

Q: What are the recommended use cases?

The model is primarily designed for video understanding tasks and can be fine-tuned for specific applications like action recognition, video classification, and feature extraction. It's particularly useful when working with large video datasets that require sophisticated feature learning.