MERT-v1-330M

Maintained By
m-a-p

MERT-v1-330M

PropertyValue
Parameter Count330M
ArchitectureTransformer (24 layers, 1024 dimensions)
Training Data160,000 hours
Sampling Rate24KHz
Feature Rate75Hz
PaperarXiv:2306.00107

What is MERT-v1-330M?

MERT-v1-330M is an advanced music understanding model developed by m-a-p, representing a significant evolution in audio processing technology. It's trained using a Masked Language Modeling (MLM) paradigm and introduces several improvements over its predecessors, including 8 codebook pseudo labels from encodec and in-batch noise mixture training.

Implementation Details

The model employs a sophisticated transformer architecture with 24 layers and 1024-dimensional features. It processes audio at 24KHz sampling rate and outputs features at 75Hz, making it particularly suitable for high-fidelity music understanding tasks.

  • Utilizes 8 codebooks from encodec for pseudo labels
  • Implements MLM prediction with in-batch noise mixture
  • Processes high-frequency audio (24KHz)
  • Trained on extensive dataset (160K hours)

Core Capabilities

  • Music understanding and feature extraction
  • Support for music generation tasks
  • High-quality audio processing
  • Flexible feature output from different transformer layers

Frequently Asked Questions

Q: What makes this model unique?

MERT-v1-330M stands out for its large-scale training (160K hours), higher audio frequency processing, and advanced MLM paradigm with in-batch noise mixture. The model's architecture and training approach make it particularly effective for music understanding tasks.

Q: What are the recommended use cases?

The model is well-suited for music understanding tasks, audio feature extraction, and potentially music generation. Its various transformer layers can be leveraged differently depending on the specific downstream task requirements.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.