ljspeech_tts_train_jets_raw_phn_tacotron_g2p_en_no_space_train.total_count.ave

Maintained By
imdanboy

JETS TTS Model for LJSpeech

PropertyValue
Authorimdanboy
FrameworkESPnet2
DatasetLJSpeech
Model TypeText-to-Speech (TTS)
RepositoryHugging Face

What is ljspeech_tts_train_jets_raw_phn_tacotron_g2p_en_no_space_train.total_count.ave?

This is an advanced text-to-speech model built using the JETS (Joint End-to-End Training for Speech synthesis) architecture within the ESPnet2 framework. The model utilizes phone-level encoding and Tacotron-style grapheme-to-phoneme (G2P) processing specifically optimized for English speech synthesis.

Implementation Details

The model implements a sophisticated architecture with multiple key components including a transformer-based encoder-decoder structure, along with specialized pitch and energy predictors. It features 4 encoder and decoder layers, with 1024 units each, and employs multi-head attention with 2 heads.

  • Generator architecture with 256-dimensional attention dimension
  • Advanced duration predictor with 2 layers and 256 channels
  • Sophisticated pitch predictor with 5 layers and 256 channels
  • Energy predictor featuring 2 layers with 3x3 kernel size
  • HiFiGAN-style multi-scale multi-period discriminator

Core Capabilities

  • High-quality speech synthesis at 22050Hz sampling rate
  • Efficient processing with 256-hop length and 1024-point FFT
  • Advanced pitch modeling with F0 range of 80-400Hz
  • Token-level processing with 78 unique phoneme tokens
  • Integrated global mean-variance normalization for consistent output

Frequently Asked Questions

Q: What makes this model unique?

This model combines JETS architecture with phone-level processing and specialized G2P conversion, making it particularly effective for English TTS tasks. Its integrated pitch and energy prediction capabilities enable more natural-sounding speech synthesis.

Q: What are the recommended use cases?

The model is ideal for applications requiring high-quality English speech synthesis, particularly when phone-level control is needed. It's well-suited for audiobook generation, virtual assistants, and other applications requiring natural-sounding speech output.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.