xm_transformer_s2ut_en-hk
Property | Value |
---|---|
License | CC-BY-NC-4.0 |
Framework | Fairseq |
Task Type | Speech-to-Speech Translation |
Dataset | MuST-C |
What is xm_transformer_s2ut_en-hk?
The xm_transformer_s2ut_en-hk is a specialized speech-to-speech translation model designed to convert English speech directly into Hokkien speech. Built by Facebook using the Fairseq framework, it implements a single-pass decoder (S2UT) architecture for efficient translation. The model has been trained on both supervised TED domain data and weakly supervised data from TED and Audiobook domains.
Implementation Details
This model utilizes a sophisticated pipeline that combines speech recognition and translation into a single process. It integrates with the facebook/unit_hifigan_HK_layer12.km2500_frame_TAT-TTS vocoder for speech synthesis, requiring 16000Hz mono channel audio input for optimal performance.
- Direct speech-to-speech translation without intermediate text representation
- Trained on high-quality TED talks and audiobook data
- Integrates with HiFiGAN vocoder for natural speech synthesis
Core Capabilities
- Direct English to Hokkien speech translation
- High-quality voice synthesis through unit HiFiGAN integration
- Support for real-time processing of 16kHz mono audio
- Efficient single-pass decoding architecture
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its ability to perform direct speech-to-speech translation between English and Hokkien without requiring intermediate text representation, making it more efficient and potentially more accurate for spoken language translation.
Q: What are the recommended use cases?
The model is ideal for applications requiring English to Hokkien translation in TED-talk style contexts, educational settings, and general speech translation scenarios where natural-sounding output is crucial.