MERT-v0-public
Property | Value |
---|---|
Parameter Count | 95M |
Model Type | Music Understanding Model |
Architecture | 12-layer Transformer with 768-dimensional features |
Training Data | 900 hours of open-source music |
Sample Rate | 16KHz |
Feature Rate | 50Hz |
Paper | arXiv:2306.00107 |
What is MERT-v0-public?
MERT-v0-public is a completely unsupervised music audio understanding model trained exclusively on non-commercial open-source datasets, including Music4All and a filtered version of FMA_full. It represents part of the m-a-p model family and employs the Masked Language Model (MLM) paradigm for training.
Implementation Details
The model features a transformer-based architecture with 12 layers and 768-dimensional feature outputs. It processes audio at 16KHz and generates features at 50Hz (50 features per second). The model can handle 5-second context windows during pre-training and outputs 13 layers of representations (including the input layer) that can be used for various downstream tasks.
- Pre-trained using MLM paradigm on 900 hours of music
- Supports variable length audio inputs
- Generates layer-wise representations suitable for different tasks
- Implements efficient feature extraction and processing
Core Capabilities
- Music audio representation learning
- Unsupervised feature extraction
- Support for downstream music understanding tasks
- Time-based and utterance-level classification
- Flexible feature aggregation options
Frequently Asked Questions
Q: What makes this model unique?
MERT-v0-public stands out for being trained entirely on open-source, non-commercial music data, making it particularly suitable for academic and research applications. Its MLM training paradigm and layered architecture allow for flexible feature extraction at different levels of abstraction.
Q: What are the recommended use cases?
The model is well-suited for various music understanding tasks, including music classification, feature extraction, and analysis. It can be used for both time-based analysis and utterance-level classification, with the ability to leverage different transformer layers for task-specific optimizations.