bigvgan_v2_22khz_80band_256x

bigvgan_v2_22khz_80band_256x

nvidia

A universal neural vocoder for high-quality audio generation, supporting 22kHz sampling rate with 80 mel bands and 256x upsampling, built by NVIDIA.

PropertyValue
LicenseMIT
PaperResearch Paper
Parameters112M
Sampling Rate22 kHz
Mel Bands80
Upsampling Ratio256x

What is bigvgan_v2_22khz_80band_256x?

BigVGAN v2 is a state-of-the-art neural vocoder developed by NVIDIA for high-quality audio generation. This specific model variant operates at 22kHz sampling rate with 80 mel frequency bands and provides a 256x upsampling ratio. It represents an advanced iteration of the original BigVGAN architecture, trained on a large-scale compilation of diverse audio data.

Implementation Details

The model implements a sophisticated architecture that includes custom CUDA kernels for accelerated inference, supporting 1.5-3x faster processing on A100 GPUs. It utilizes a multi-scale sub-band CQT discriminator and multi-scale mel spectrogram loss for improved audio quality.

  • Custom CUDA kernel implementation for faster inference
  • Multi-scale sub-band CQT discriminator architecture
  • Comprehensive mel spectrogram loss function
  • Trained on diverse audio datasets including speech, environmental sounds, and instruments

Core Capabilities

  • High-quality audio synthesis from mel spectrograms
  • Fast inference with optional CUDA kernel optimization
  • Support for both CPU and GPU execution
  • Efficient processing of various audio types

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its optimized performance through custom CUDA kernels, comprehensive training on diverse audio types, and advanced architecture incorporating multi-scale discriminators. The balance of high-quality output with efficient processing makes it particularly valuable for production environments.

Q: What are the recommended use cases?

The model is ideal for text-to-speech systems, audio content generation, and any applications requiring high-quality voice synthesis. It's particularly well-suited for applications needing real-time audio generation due to its optimized inference capabilities.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026