MMS-TTS-YOR: Yoruba Text-to-Speech Model

Property	Value
Developer	Facebook (Meta AI)
License	CC-BY-NC 4.0
Model Type	VITS (Variational Inference with adversarial learning for TTS)
Paper	Scaling Speech Technology to 1,000+ Languages (2023)

What is mms-tts-yor?

MMS-TTS-YOR is a specialized text-to-speech model designed for the Yoruba language, developed as part of Facebook's Massively Multilingual Speech (MMS) project. This model utilizes the VITS architecture to provide end-to-end speech synthesis capabilities, converting Yoruba text into natural-sounding speech.

Implementation Details

The model implements a conditional variational autoencoder (VAE) architecture with three main components: a posterior encoder, decoder, and conditional prior. It utilizes a Transformer-based text encoder combined with flow-based modules for spectrogram prediction, followed by HiFi-GAN vocoder-style transposed convolutional layers for final audio generation.

Stochastic duration predictor for varied speech rhythms
Flow-based module with coupling layers
End-to-end training with variational lower bound and adversarial losses
Non-deterministic output requiring seed fixing for reproducibility

Core Capabilities

Direct text-to-speech synthesis for Yoruba language
Variable speech rhythm generation
High-quality spectrogram-based acoustic feature prediction
Integrated with 🤗 Transformers library (v4.33+)

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically trained for Yoruba language speech synthesis and uses a sophisticated VITS architecture that allows for natural variation in speech patterns through its stochastic duration predictor.

Q: What are the recommended use cases?

The model is ideal for applications requiring Yoruba language text-to-speech conversion, such as accessibility tools, educational software, or automated voice systems. It's particularly useful when natural-sounding speech variation is desired.

mms-tts-yor