Metis

Property	Value
Author	amphion
Model URL	https://huggingface.co/amphion/Metis
Paper	arXiv:2502.03128

What is Metis?

Metis is a groundbreaking foundation model for unified speech generation that employs masked generative pre-training on large-scale unlabeled speech data. It represents a significant advancement in speech synthesis technology, capable of handling multiple speech generation tasks through a single pre-trained model with fewer than 20M trainable parameters.

Implementation Details

The model architecture consists of three key components: a Semantic Codec for converting speech to semantic tokens, an Acoustic Codec for handling acoustic tokens and waveform reconstruction, and a Semantic2Acoustic component for predicting acoustic tokens based on semantic inputs. Metis utilizes two discrete speech representations: SSL tokens and acoustic tokens, pre-trained on 300K hours of diverse speech data.

Masked generative pre-training approach
Efficient fine-tuning capability for task adaptation
Support for multiple speech generation tasks
Compact model size with high performance

Core Capabilities

Zero-shot text-to-speech synthesis
Voice conversion
Target speaker extraction
Speech enhancement
Lip-to-speech generation

Frequently Asked Questions

Q: What makes this model unique?

Metis stands out for its ability to handle multiple speech generation tasks through a single pre-trained model, achieving state-of-the-art results with significantly fewer parameters and less training data than task-specific systems.

Q: What are the recommended use cases?

The model is ideal for applications requiring speech generation, enhancement, or conversion. It can be used for text-to-speech systems, voice conversion applications, speech enhancement in noisy environments, and even generating speech from lip movements.

Metis

Metis

What is Metis?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models