MERT-v0-public

Maintained By
m-a-p

MERT-v0-public

PropertyValue
Parameter Count95M
Model TypeMusic Understanding Model
Architecture12-layer Transformer with 768-dimensional features
Training Data900 hours of open-source music
Sample Rate16KHz
Feature Rate50Hz
PaperarXiv:2306.00107

What is MERT-v0-public?

MERT-v0-public is a completely unsupervised music audio understanding model trained exclusively on non-commercial open-source datasets, including Music4All and a filtered version of FMA_full. It represents part of the m-a-p model family and employs the Masked Language Model (MLM) paradigm for training.

Implementation Details

The model features a transformer-based architecture with 12 layers and 768-dimensional feature outputs. It processes audio at 16KHz and generates features at 50Hz (50 features per second). The model can handle 5-second context windows during pre-training and outputs 13 layers of representations (including the input layer) that can be used for various downstream tasks.

  • Pre-trained using MLM paradigm on 900 hours of music
  • Supports variable length audio inputs
  • Generates layer-wise representations suitable for different tasks
  • Implements efficient feature extraction and processing

Core Capabilities

  • Music audio representation learning
  • Unsupervised feature extraction
  • Support for downstream music understanding tasks
  • Time-based and utterance-level classification
  • Flexible feature aggregation options

Frequently Asked Questions

Q: What makes this model unique?

MERT-v0-public stands out for being trained entirely on open-source, non-commercial music data, making it particularly suitable for academic and research applications. Its MLM training paradigm and layered architecture allow for flexible feature extraction at different levels of abstraction.

Q: What are the recommended use cases?

The model is well-suited for various music understanding tasks, including music classification, feature extraction, and analysis. It can be used for both time-based analysis and utterance-level classification, with the ability to leverage different transformer layers for task-specific optimizations.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.