xm_transformer_unity_en-hk

facebook

Speech-to-speech translation model for English to Hokkien conversion, built on fairseq framework with two-pass decoder (UnitY) for TED and Audiobook domains.

Property	Value
License	cc-by-nc-4.0
Framework	Fairseq
Task Type	Speech-to-Speech Translation
Dataset	MuST-C

What is xm_transformer_unity_en-hk?

The xm_transformer_unity_en-hk is a sophisticated speech-to-speech translation model developed by Facebook that directly converts English speech into Hokkien speech. It utilizes a two-pass decoder system called UnitY and is specifically trained on both supervised TED domain data and weakly supervised data from TED and Audiobook domains.

Implementation Details

This model implements a complex pipeline that combines speech recognition and synthesis. It uses the facebook/unit_hifigan_HK_layer12.km2500_frame_TAT-TTS for speech synthesis and requires 16000Hz mono channel audio input. The implementation leverages the Fairseq framework and includes comprehensive audio processing capabilities.

Two-pass decoder architecture with UnitY system
Integrated speech synthesis using HiFiGAN vocoder
Support for both TED and Audiobook domain translations
Direct speech-to-speech conversion without intermediate text representation

Core Capabilities

Direct English to Hokkien speech translation
High-quality speech synthesis using specialized vocoder
Processing of 16kHz mono channel audio
Support for both supervised and weakly supervised training data

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its ability to perform direct speech-to-speech translation between English and Hokkien, a language pair that traditionally has limited resources. The two-pass decoder system and integration with specialized vocoders make it particularly effective for real-world applications.

Q: What are the recommended use cases?

The model is ideal for applications requiring English to Hokkien translation in TED-talk style content and audiobook contexts. It's particularly suitable for scenarios where direct speech output is needed without intermediate text representation.