fish-agent-v0.1-3b

fishaudio

A powerful 3B parameter Voice-to-Voice model supporting 8 languages, with 700,000 hours of training data and semantic-token-free architecture.

Property	Value
Model Size	3B parameters
License	CC-BY-NC-SA-4.0
Languages Supported	8 (English, Chinese, German, Japanese, French, Spanish, Korean, Arabic)
Training Data	700,000 hours

What is fish-agent-v0.1-3b?

Fish Agent V0.1 3B is a revolutionary Voice-to-Voice model that represents a significant advancement in audio processing technology. Built on Qwen-2.5-3B-Instruct and further trained on 200B voice & text tokens, it uniquely processes environmental audio information without requiring traditional semantic encoders/decoders like Whisper and CosyVoice.

Implementation Details

The model employs a semantic-token-free architecture, making it more efficient and direct in audio processing. It has been extensively trained on multilingual content, with particularly robust coverage of English and Chinese (300,000 hours each) and significant training data for six other languages (20,000 hours each).

Continue-pretrained version of Qwen-2.5-3B-Instruct
Trained on 200B voice & text tokens
Supports both audio-to-audio and text-to-speech capabilities

Core Capabilities

Voice-to-Voice conversion with environmental audio preservation
High-quality text-to-speech generation
Multilingual support across 8 major languages
Direct audio processing without semantic token intermediaries

Frequently Asked Questions

Q: What makes this model unique?

Its semantic-token-free architecture and ability to handle environmental audio information sets it apart from traditional voice models, offering more direct and efficient audio processing.

Q: What are the recommended use cases?

The model is ideal for voice conversion, text-to-speech applications, and multilingual audio processing, particularly in non-commercial settings as per its license requirements.