fish-speech-1.4

fishaudio

Multilingual TTS model trained on 700k hours across 8 languages, with emphasis on English/Chinese (300k hours each). Non-commercial use only.

Property	Value
License	CC-BY-NC-SA-4.0
Research Paper	arXiv:2411.01156
Languages Supported	8 (English, Chinese, German, Japanese, French, Spanish, Korean, Arabic)
Training Data Size	700,000 hours

What is fish-speech-1.4?

Fish Speech V1.4 is a state-of-the-art multilingual text-to-speech (TTS) model that represents a significant advancement in speech synthesis technology. Trained on an impressive 700,000 hours of audio data across eight different languages, it leverages large language models for enhanced multilingual speech synthesis capabilities.

Implementation Details

The model has been trained with a particular focus on English and Chinese, with approximately 300,000 hours of training data for each of these languages. The remaining six languages (German, Japanese, French, Spanish, Korean, and Arabic) each benefit from around 20,000 hours of training data, ensuring robust performance across all supported languages.

Primary language support: English and Chinese (300k hours each)
Secondary language support: 20k hours each for German, Japanese, French, Spanish, Korean, and Arabic
Implementation available on GitHub with demo access through Fish Audio platform

Core Capabilities

High-quality speech synthesis in 8 different languages
Advanced multilingual text processing
Balanced performance across various language pairs
Research-focused architecture leveraging LLM technologies

Frequently Asked Questions

Q: What makes this model unique?

The model's extensive training data (700k hours) and balanced approach to major languages sets it apart, especially with its deep focus on English and Chinese content. The integration with large language models for text-to-speech synthesis represents a novel approach in multilingual TTS systems.

Q: What are the recommended use cases?

The model is ideal for research purposes and non-commercial applications requiring high-quality multilingual speech synthesis. It's particularly well-suited for applications requiring English or Chinese speech synthesis, given the extensive training in these languages.