BreezyVoice

MediaTek-Research

BreezyVoice is an advanced text-to-speech system specialized for Taiwanese Mandarin with code-switching capabilities and polyphone disambiguation using 注音 input.

Property	Value
Author	MediaTek-Research
Paper	arXiv:2501.17790
Model Type	Text-to-Speech (TTS)
Primary Focus	Taiwanese Mandarin Voice Synthesis

What is BreezyVoice?

BreezyVoice is an innovative text-to-speech system specifically designed for Taiwanese Mandarin, featuring advanced voice-cloning capabilities and enhanced polyphone disambiguation through 注音 (bopomofo) inputs. Built upon CosyVoice architecture, it represents a significant advancement in handling code-switching scenarios and natural speech synthesis.

Implementation Details

The model can be easily implemented through GitHub or by cloning the repository using Git LFS. It's designed to work with local paths and offers flexible deployment options through the single_inference.py script.

Advanced polyphone disambiguation system
Voice cloning capabilities
Integration with 注音 (bopomofo) input system
Built on CosyVoice architecture

Core Capabilities

Superior performance in code-switching scenarios
Excellent handling of general words (8/10 rating)
Strong performance with entities (9/10 rating)
Effective abbreviation processing (9/10 rating)
Natural full sentence synthesis (7/10 rating)

Frequently Asked Questions

Q: What makes this model unique?

BreezyVoice stands out for its specialized focus on Taiwanese Mandarin and superior performance in code-switching scenarios, particularly excelling in handling entities and abbreviations. Its integration with 注音 input system makes it particularly effective for accurate pronunciation.

Q: What are the recommended use cases?

The model is ideal for applications requiring high-quality Taiwanese Mandarin speech synthesis, especially in scenarios involving code-switching, entity names, and abbreviations. It's particularly useful for applications requiring accurate pronunciation control through bopomofo input.