JETS TTS Model for LJSpeech

Property	Value
Author	imdanboy
Framework	ESPnet2
Dataset	LJSpeech
Model Type	Text-to-Speech (TTS)
Repository	Hugging Face

What is ljspeech_tts_train_jets_raw_phn_tacotron_g2p_en_no_space_train.total_count.ave?

This is an advanced text-to-speech model built using the JETS (Joint End-to-End Training for Speech synthesis) architecture within the ESPnet2 framework. The model utilizes phone-level encoding and Tacotron-style grapheme-to-phoneme (G2P) processing specifically optimized for English speech synthesis.

Implementation Details

The model implements a sophisticated architecture with multiple key components including a transformer-based encoder-decoder structure, along with specialized pitch and energy predictors. It features 4 encoder and decoder layers, with 1024 units each, and employs multi-head attention with 2 heads.

Generator architecture with 256-dimensional attention dimension
Advanced duration predictor with 2 layers and 256 channels
Sophisticated pitch predictor with 5 layers and 256 channels
Energy predictor featuring 2 layers with 3x3 kernel size
HiFiGAN-style multi-scale multi-period discriminator

Core Capabilities

High-quality speech synthesis at 22050Hz sampling rate
Efficient processing with 256-hop length and 1024-point FFT
Advanced pitch modeling with F0 range of 80-400Hz
Token-level processing with 78 unique phoneme tokens
Integrated global mean-variance normalization for consistent output

Frequently Asked Questions

Q: What makes this model unique?

This model combines JETS architecture with phone-level processing and specialized G2P conversion, making it particularly effective for English TTS tasks. Its integrated pitch and energy prediction capabilities enable more natural-sounding speech synthesis.

Q: What are the recommended use cases?

The model is ideal for applications requiring high-quality English speech synthesis, particularly when phone-level control is needed. It's well-suited for audiobook generation, virtual assistants, and other applications requiring natural-sounding speech output.

ljspeech_tts_train_jets_raw_phn_tacotron_g2p_en_no_space_train.total_count.ave