vocos-mel-hifigan-compat-44100khz

patriotyk

Fast neural vocoder for high-quality speech synthesis from mel spectrograms, compatible with HiFi-GAN features, trained on 800+ hours of Ukrainian audiobooks at 44.1kHz

Property	Value
Author	patriotyk
Training Data	800+ hours Ukrainian audiobooks
Paper	Vocos: Closing the gap between time-domain and Fourier-based neural vocoders
Sample Rate	44.1kHz

What is vocos-mel-hifigan-compat-44100khz?

This is an advanced neural vocoder designed to efficiently synthesize high-quality audio waveforms from mel spectrograms. Unlike traditional GAN-based vocoders, it operates in the spectral domain rather than time domain, enabling faster audio reconstruction through inverse Fourier transform. The model specifically works with 80-bin mel spectrograms, making it compatible with many existing TTS systems.

Implementation Details

The model was trained for 2.0M steps across 210 epochs using a batch size of 20. Training utilized two RTX-3090 GPUs over approximately one month, implementing a Cosine scheduler with an initial learning rate of 3e-4. The architecture focuses on spectral coefficient generation rather than direct time-domain synthesis.

Mel spectrogram input: 80 bins
Sampling rate: 44.1kHz
Training metrics achieved: PESQ score of 3.399, UTMOS score of 3.146
Optimized mel loss coefficient: 45
MRD loss coefficient: 1.0

Core Capabilities

Fast audio synthesis from mel spectrograms
HiFi-GAN compatibility for easy integration
High-quality speech synthesis
Efficient spectral domain processing
44.1kHz high-resolution audio output

Frequently Asked Questions

Q: What makes this model unique?

This model stands out by operating in the spectral domain rather than time domain, offering faster synthesis while maintaining high quality. It's specifically designed to be compatible with HiFi-GAN mel spectrogram formats, making it an excellent drop-in replacement for existing TTS pipelines.

Q: What are the recommended use cases?

The model is primarily designed for speech synthesis applications, particularly in text-to-speech systems that output mel spectrograms. While it excels at speech synthesis, it may not produce optimal results for other audio domains.