ljspeech_tts_train_jets_raw_phn_tacotron_g2p_en_no_space_train.total_count.ave

ljspeech_tts_train_jets_raw_phn_tacotron_g2p_en_no_space_train.total_count.ave

imdanboy

ESPnet2 TTS model using JETS architecture trained on LJSpeech dataset, featuring phone-level encoding and Tacotron-style G2P processing for English TTS.

PropertyValue
Authorimdanboy
FrameworkESPnet2
DatasetLJSpeech
Model TypeText-to-Speech (TTS)
RepositoryHugging Face

What is ljspeech_tts_train_jets_raw_phn_tacotron_g2p_en_no_space_train.total_count.ave?

This is an advanced text-to-speech model built using the JETS (Joint End-to-End Training for Speech synthesis) architecture within the ESPnet2 framework. The model utilizes phone-level encoding and Tacotron-style grapheme-to-phoneme (G2P) processing specifically optimized for English speech synthesis.

Implementation Details

The model implements a sophisticated architecture with multiple key components including a transformer-based encoder-decoder structure, along with specialized pitch and energy predictors. It features 4 encoder and decoder layers, with 1024 units each, and employs multi-head attention with 2 heads.

  • Generator architecture with 256-dimensional attention dimension
  • Advanced duration predictor with 2 layers and 256 channels
  • Sophisticated pitch predictor with 5 layers and 256 channels
  • Energy predictor featuring 2 layers with 3x3 kernel size
  • HiFiGAN-style multi-scale multi-period discriminator

Core Capabilities

  • High-quality speech synthesis at 22050Hz sampling rate
  • Efficient processing with 256-hop length and 1024-point FFT
  • Advanced pitch modeling with F0 range of 80-400Hz
  • Token-level processing with 78 unique phoneme tokens
  • Integrated global mean-variance normalization for consistent output

Frequently Asked Questions

Q: What makes this model unique?

This model combines JETS architecture with phone-level processing and specialized G2P conversion, making it particularly effective for English TTS tasks. Its integrated pitch and energy prediction capabilities enable more natural-sounding speech synthesis.

Q: What are the recommended use cases?

The model is ideal for applications requiring high-quality English speech synthesis, particularly when phone-level control is needed. It's well-suited for audiobook generation, virtual assistants, and other applications requiring natural-sounding speech output.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026