XEUS: Cross-lingual Encoder for Universal Speech
Property | Value |
---|---|
Parameters | 577M |
License | CC-BY-NC-SA-4.0 |
Architecture | E-Branchformer |
Paper | Link to Paper |
What is XEUS?
XEUS is a groundbreaking multilingual speech encoder developed by Carnegie Mellon University's WAVLab that supports over 4000 languages. It represents a significant advancement in universal speech processing, trained on more than 1 million hours of publicly available speech data. The model employs the E-Branchformer architecture and utilizes HuBERT-style masked prediction for training.
Implementation Details
The model incorporates several innovative technical features:
- Trained using masked prediction of discrete speech tokens from WavLabLM
- Implements acoustic noise and reverberation augmentation for enhanced robustness
- Supports Flash Attention for improved performance
- Offers customizable masking settings for fine-tuning
Core Capabilities
- State-of-the-art performance on ML-SUPERB multilingual speech recognition
- Exceeds performance of models like MMS, w2v-BERT 2.0, and XLS-R
- Sets new benchmarks on 4 tasks in the monolingual SUPERB benchmark
- Provides robust speech representations across thousands of languages
Frequently Asked Questions
Q: What makes this model unique?
XEUS stands out for its unprecedented language coverage (4000+ languages) and its robust performance across multiple speech processing tasks. The model's architecture and training approach, combining E-Branchformer with acoustic augmentation, make it particularly effective for universal speech processing.
Q: What are the recommended use cases?
The model is primarily designed for speech recognition and translation tasks, requiring fine-tuning for specific applications. It can also be used for semantic speech tokenization through k-means clustering of its hidden states. The model is particularly valuable for multilingual applications and research in low-resource languages.