MaskGCT

Property	Value
License	CC-BY-NC-4.0
Supported Languages	English, Chinese, Korean, Japanese, French, German
Training Dataset	Amphion/Emilia-Dataset (100K hours)
Paper	arXiv:2409.00750

What is MaskGCT?

MaskGCT is a groundbreaking zero-shot text-to-speech model that introduces a fully non-autoregressive architecture, eliminating the need for explicit alignment information between text and speech supervision. The model leverages a masked generative codec transformer approach to achieve high-quality speech synthesis across multiple languages.

Implementation Details

The model architecture consists of four main components: Semantic Codec, Acoustic Codec, MaskGCT-T2S, and MaskGCT-S2A. These components work together to convert text into high-quality speech through a series of transformations involving semantic and acoustic tokens.

Semantic Codec: Converts speech to semantic tokens
Acoustic Codec: Handles acoustic token conversion and waveform reconstruction
MaskGCT-T2S: Predicts semantic tokens from text and prompt
MaskGCT-S2A: Generates acoustic tokens from semantic tokens

Core Capabilities

Zero-shot text-to-speech synthesis in 6 languages
Non-autoregressive generation for faster inference
Support for custom duration control
Prompt-based speech generation
High-quality acoustic reconstruction

Frequently Asked Questions

Q: What makes this model unique?

MaskGCT's non-autoregressive architecture and ability to perform zero-shot text-to-speech synthesis without explicit alignment information sets it apart from traditional TTS models. The model can generate high-quality speech across multiple languages using only a prompt.

Q: What are the recommended use cases?

The model is ideal for multilingual text-to-speech applications, particularly when prompt-based voice cloning or adaptation is needed. It's especially useful in scenarios requiring fast inference times due to its non-autoregressive nature.

MaskGCT

MaskGCT

What is MaskGCT?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models