EnCodec 24kHz Neural Audio Codec
Property | Value |
---|---|
Parameters | 23.3M |
Model Type | Audio Codec |
Author | Meta AI |
Paper | High Fidelity Neural Audio Compression |
Tensor Type | F32 |
What is encodec_24khz?
EnCodec 24kHz is a state-of-the-art neural audio codec developed by Meta AI that provides real-time audio compression and decompression. It features a streaming encoder-decoder architecture with quantized latent space, trained end-to-end for optimal performance. The model leverages a unique multiscale spectrogram adversary to reduce artifacts and enhance audio quality.
Implementation Details
The model employs a sophisticated architecture trained on multiple datasets including DNS Challenge 4, Common Voice, AudioSet, FSD50K, and the Jamendo dataset. It was trained for 300 epochs using 8 A100 GPUs, with Adam optimizer and a batch size of 64 examples.
- Supports both streamable and non-streamable configurations
- Operates at various bandwidths (1.5, 3, 6, and 12 kbps)
- Includes weight normalization for convolution layers
- Features a novel loss balancer mechanism for training stability
Core Capabilities
- Real-time audio compression and decompression
- High-fidelity audio reproduction
- Bandwidth reduction of 25-40% with language model integration
- Support for both speech and music processing
- Multiple sampling rate compatibility
Frequently Asked Questions
Q: What makes this model unique?
EnCodec stands out for its real-time performance and superior audio quality, consistently outperforming baselines like Lyra-v2 and Opus. It achieves better performance at 3 kbps compared to Opus at 12 kbps, making it highly efficient.
Q: What are the recommended use cases?
The model is ideal for real-time audio compression applications, speech generation, music streaming, and text-to-speech tasks. It can be used directly or fine-tuned for specific audio processing needs in larger pipelines.