MaskGCT

Maintained By
amphion

MaskGCT

PropertyValue
LicenseCC-BY-NC-4.0
Supported LanguagesEnglish, Chinese, Korean, Japanese, French, German
Training DatasetAmphion/Emilia-Dataset (100K hours)
PaperarXiv:2409.00750

What is MaskGCT?

MaskGCT is a groundbreaking zero-shot text-to-speech model that introduces a fully non-autoregressive architecture, eliminating the need for explicit alignment information between text and speech supervision. The model leverages a masked generative codec transformer approach to achieve high-quality speech synthesis across multiple languages.

Implementation Details

The model architecture consists of four main components: Semantic Codec, Acoustic Codec, MaskGCT-T2S, and MaskGCT-S2A. These components work together to convert text into high-quality speech through a series of transformations involving semantic and acoustic tokens.

  • Semantic Codec: Converts speech to semantic tokens
  • Acoustic Codec: Handles acoustic token conversion and waveform reconstruction
  • MaskGCT-T2S: Predicts semantic tokens from text and prompt
  • MaskGCT-S2A: Generates acoustic tokens from semantic tokens

Core Capabilities

  • Zero-shot text-to-speech synthesis in 6 languages
  • Non-autoregressive generation for faster inference
  • Support for custom duration control
  • Prompt-based speech generation
  • High-quality acoustic reconstruction

Frequently Asked Questions

Q: What makes this model unique?

MaskGCT's non-autoregressive architecture and ability to perform zero-shot text-to-speech synthesis without explicit alignment information sets it apart from traditional TTS models. The model can generate high-quality speech across multiple languages using only a prompt.

Q: What are the recommended use cases?

The model is ideal for multilingual text-to-speech applications, particularly when prompt-based voice cloning or adaptation is needed. It's especially useful in scenarios requiring fast inference times due to its non-autoregressive nature.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.