vocos-mel-hifigan-compat-44100khz

Maintained By
patriotyk

vocos-mel-hifigan-compat-44100khz

PropertyValue
Authorpatriotyk
Training Data800+ hours Ukrainian audiobooks
PaperVocos: Closing the gap between time-domain and Fourier-based neural vocoders
Sample Rate44.1kHz

What is vocos-mel-hifigan-compat-44100khz?

This is an advanced neural vocoder designed to efficiently synthesize high-quality audio waveforms from mel spectrograms. Unlike traditional GAN-based vocoders, it operates in the spectral domain rather than time domain, enabling faster audio reconstruction through inverse Fourier transform. The model specifically works with 80-bin mel spectrograms, making it compatible with many existing TTS systems.

Implementation Details

The model was trained for 2.0M steps across 210 epochs using a batch size of 20. Training utilized two RTX-3090 GPUs over approximately one month, implementing a Cosine scheduler with an initial learning rate of 3e-4. The architecture focuses on spectral coefficient generation rather than direct time-domain synthesis.

  • Mel spectrogram input: 80 bins
  • Sampling rate: 44.1kHz
  • Training metrics achieved: PESQ score of 3.399, UTMOS score of 3.146
  • Optimized mel loss coefficient: 45
  • MRD loss coefficient: 1.0

Core Capabilities

  • Fast audio synthesis from mel spectrograms
  • HiFi-GAN compatibility for easy integration
  • High-quality speech synthesis
  • Efficient spectral domain processing
  • 44.1kHz high-resolution audio output

Frequently Asked Questions

Q: What makes this model unique?

This model stands out by operating in the spectral domain rather than time domain, offering faster synthesis while maintaining high quality. It's specifically designed to be compatible with HiFi-GAN mel spectrogram formats, making it an excellent drop-in replacement for existing TTS pipelines.

Q: What are the recommended use cases?

The model is primarily designed for speech synthesis applications, particularly in text-to-speech systems that output mel spectrograms. While it excels at speech synthesis, it may not produce optimal results for other audio domains.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.