xm_transformer_unity_en-hk

xm_transformer_unity_en-hk

facebook

Speech-to-speech translation model for English to Hokkien conversion, built on fairseq framework with two-pass decoder (UnitY) for TED and Audiobook domains.

PropertyValue
Licensecc-by-nc-4.0
FrameworkFairseq
Task TypeSpeech-to-Speech Translation
DatasetMuST-C

What is xm_transformer_unity_en-hk?

The xm_transformer_unity_en-hk is a sophisticated speech-to-speech translation model developed by Facebook that directly converts English speech into Hokkien speech. It utilizes a two-pass decoder system called UnitY and is specifically trained on both supervised TED domain data and weakly supervised data from TED and Audiobook domains.

Implementation Details

This model implements a complex pipeline that combines speech recognition and synthesis. It uses the facebook/unit_hifigan_HK_layer12.km2500_frame_TAT-TTS for speech synthesis and requires 16000Hz mono channel audio input. The implementation leverages the Fairseq framework and includes comprehensive audio processing capabilities.

  • Two-pass decoder architecture with UnitY system
  • Integrated speech synthesis using HiFiGAN vocoder
  • Support for both TED and Audiobook domain translations
  • Direct speech-to-speech conversion without intermediate text representation

Core Capabilities

  • Direct English to Hokkien speech translation
  • High-quality speech synthesis using specialized vocoder
  • Processing of 16kHz mono channel audio
  • Support for both supervised and weakly supervised training data

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its ability to perform direct speech-to-speech translation between English and Hokkien, a language pair that traditionally has limited resources. The two-pass decoder system and integration with specialized vocoders make it particularly effective for real-world applications.

Q: What are the recommended use cases?

The model is ideal for applications requiring English to Hokkien translation in TED-talk style content and audiobook contexts. It's particularly suitable for scenarios where direct speech output is needed without intermediate text representation.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026