SNAC 24kHz Audio Codec
Property | Value |
---|---|
Model Size | 19.8M parameters |
License | MIT |
Bitrate | 0.98 kbps |
Sample Rate | 24 kHz |
Architecture | Multi-Scale Neural Audio Codec |
What is snac_24khz?
SNAC (Multi-Scale Neural Audio Codec) is an innovative audio compression model designed specifically for speech synthesis applications. It implements a hierarchical token-based approach to compress audio efficiently while maintaining high quality at remarkably low bitrates.
Implementation Details
The model employs a sophisticated architecture that utilizes 3 RVQ (Residual Vector Quantization) levels, operating at different temporal resolutions - 12, 23, and 47 Hz. This multi-scale approach allows for efficient compression while preserving audio quality.
- Supports single-channel (mono) audio processing
- Implements hierarchical token compression similar to SoundStream and EnCodec
- Features unique coarse token sampling at reduced frequencies
- Achieves compression to 0.98 kbps bitrate
Core Capabilities
- High-quality speech audio compression
- Efficient encoding and decoding of 24kHz audio
- Variable temporal resolution processing
- PyTorch-based implementation with CUDA support
Frequently Asked Questions
Q: What makes this model unique?
SNAC's distinctive feature is its multi-scale approach where coarse tokens are sampled less frequently, covering broader time spans. This innovative design allows for efficient compression while maintaining audio quality.
Q: What are the recommended use cases?
This model is primarily optimized for speech synthesis applications. While it can process other audio types, it performs best with speech data due to its specific training focus.