Metis
Property | Value |
---|---|
Author | amphion |
Model URL | https://huggingface.co/amphion/Metis |
Paper | arXiv:2502.03128 |
What is Metis?
Metis is a groundbreaking foundation model for unified speech generation that employs masked generative pre-training on large-scale unlabeled speech data. It represents a significant advancement in speech synthesis technology, capable of handling multiple speech generation tasks through a single pre-trained model with fewer than 20M trainable parameters.
Implementation Details
The model architecture consists of three key components: a Semantic Codec for converting speech to semantic tokens, an Acoustic Codec for handling acoustic tokens and waveform reconstruction, and a Semantic2Acoustic component for predicting acoustic tokens based on semantic inputs. Metis utilizes two discrete speech representations: SSL tokens and acoustic tokens, pre-trained on 300K hours of diverse speech data.
- Masked generative pre-training approach
- Efficient fine-tuning capability for task adaptation
- Support for multiple speech generation tasks
- Compact model size with high performance
Core Capabilities
- Zero-shot text-to-speech synthesis
- Voice conversion
- Target speaker extraction
- Speech enhancement
- Lip-to-speech generation
Frequently Asked Questions
Q: What makes this model unique?
Metis stands out for its ability to handle multiple speech generation tasks through a single pre-trained model, achieving state-of-the-art results with significantly fewer parameters and less training data than task-specific systems.
Q: What are the recommended use cases?
The model is ideal for applications requiring speech generation, enhancement, or conversion. It can be used for text-to-speech systems, voice conversion applications, speech enhancement in noisy environments, and even generating speech from lip movements.