MaskGCT
Property | Value |
---|---|
License | CC-BY-NC-4.0 |
Supported Languages | English, Chinese, Korean, Japanese, French, German |
Training Dataset | Amphion/Emilia-Dataset (100K hours) |
Paper | arXiv:2409.00750 |
What is MaskGCT?
MaskGCT is a groundbreaking zero-shot text-to-speech model that introduces a fully non-autoregressive architecture, eliminating the need for explicit alignment information between text and speech supervision. The model leverages a masked generative codec transformer approach to achieve high-quality speech synthesis across multiple languages.
Implementation Details
The model architecture consists of four main components: Semantic Codec, Acoustic Codec, MaskGCT-T2S, and MaskGCT-S2A. These components work together to convert text into high-quality speech through a series of transformations involving semantic and acoustic tokens.
- Semantic Codec: Converts speech to semantic tokens
- Acoustic Codec: Handles acoustic token conversion and waveform reconstruction
- MaskGCT-T2S: Predicts semantic tokens from text and prompt
- MaskGCT-S2A: Generates acoustic tokens from semantic tokens
Core Capabilities
- Zero-shot text-to-speech synthesis in 6 languages
- Non-autoregressive generation for faster inference
- Support for custom duration control
- Prompt-based speech generation
- High-quality acoustic reconstruction
Frequently Asked Questions
Q: What makes this model unique?
MaskGCT's non-autoregressive architecture and ability to perform zero-shot text-to-speech synthesis without explicit alignment information sets it apart from traditional TTS models. The model can generate high-quality speech across multiple languages using only a prompt.
Q: What are the recommended use cases?
The model is ideal for multilingual text-to-speech applications, particularly when prompt-based voice cloning or adaptation is needed. It's especially useful in scenarios requiring fast inference times due to its non-autoregressive nature.