Kokoro-82M-v1.1-zh

Maintained By
hexgrad

Kokoro-82M-v1.1-zh

PropertyValue
Parameter Count82 Million
Model TypeText-to-Speech (TTS)
ArchitectureStyleTTS 2 + ISTFTNet
LicenseApache
LanguagesEnglish, Chinese
Training Cost120 GPU hours ($110)

What is Kokoro-82M-v1.1-zh?

Kokoro-82M-v1.1-zh is an advanced open-weight TTS model that represents a significant evolution in multilingual voice synthesis. Developed by hexgrad, this model incorporates 100 professional Chinese speakers from LongMaoData and includes three high-quality synthetic English voices. The model demonstrates efficient parameter usage while maintaining high performance, built on the StyleTTS 2 architecture.

Implementation Details

The model is implemented using a decoder-only architecture, combining StyleTTS 2 and ISTFTNet technologies. It operates without diffusion or encoder components, making it computationally efficient while maintaining quality output. The training process involved over 100 hours of voice data, carefully balanced between Chinese and English sources.

  • Utilizes StyleTTS 2 architecture (arXiv:2306.07691)
  • Implements ISTFTNet technology (arXiv:2203.02395)
  • Trained on professional Chinese dataset and synthetic English voices
  • Optimized for minimal parameter count (82M) while maintaining quality

Core Capabilities

  • Multilingual TTS support for English and Chinese
  • 103 distinct voice profiles across languages
  • Professional-quality Chinese voice synthesis
  • Three distinct English voice personalities: Maple, Sol (American), and Vale (British)
  • Efficient inference with decoder-only architecture

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its efficient architecture while supporting both Chinese and English TTS with a large variety of voices. It's particularly notable for incorporating professional Chinese voice data and synthetic English voices in a compact 82M parameter model.

Q: What are the recommended use cases?

The model is ideal for applications requiring high-quality multilingual TTS, particularly those needing Chinese language support. It's suitable for both personal and commercial projects under the Apache license, and its efficient architecture makes it viable for deployment in resource-constrained environments.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.