MERT-v1-330M

Property	Value
Parameter Count	330M
Architecture	Transformer (24 layers, 1024 dimensions)
Training Data	160,000 hours
Sampling Rate	24KHz
Feature Rate	75Hz
Paper	arXiv:2306.00107

What is MERT-v1-330M?

MERT-v1-330M is an advanced music understanding model developed by m-a-p, representing a significant evolution in audio processing technology. It's trained using a Masked Language Modeling (MLM) paradigm and introduces several improvements over its predecessors, including 8 codebook pseudo labels from encodec and in-batch noise mixture training.

Implementation Details

The model employs a sophisticated transformer architecture with 24 layers and 1024-dimensional features. It processes audio at 24KHz sampling rate and outputs features at 75Hz, making it particularly suitable for high-fidelity music understanding tasks.

Utilizes 8 codebooks from encodec for pseudo labels
Implements MLM prediction with in-batch noise mixture
Processes high-frequency audio (24KHz)
Trained on extensive dataset (160K hours)

Core Capabilities

Music understanding and feature extraction
Support for music generation tasks
High-quality audio processing
Flexible feature output from different transformer layers

Frequently Asked Questions

Q: What makes this model unique?

MERT-v1-330M stands out for its large-scale training (160K hours), higher audio frequency processing, and advanced MLM paradigm with in-batch noise mixture. The model's architecture and training approach make it particularly effective for music understanding tasks.

Q: What are the recommended use cases?

The model is well-suited for music understanding tasks, audio feature extraction, and potentially music generation. Its various transformer layers can be leveraged differently depending on the specific downstream task requirements.

MERT-v1-330M

MERT-v1-330M

What is MERT-v1-330M?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models