EnCodec 24kHz Neural Audio Codec

Property	Value
Parameters	23.3M
Model Type	Audio Codec
Author	Meta AI
Paper	High Fidelity Neural Audio Compression
Tensor Type	F32

What is encodec_24khz?

EnCodec 24kHz is a state-of-the-art neural audio codec developed by Meta AI that provides real-time audio compression and decompression. It features a streaming encoder-decoder architecture with quantized latent space, trained end-to-end for optimal performance. The model leverages a unique multiscale spectrogram adversary to reduce artifacts and enhance audio quality.

Implementation Details

The model employs a sophisticated architecture trained on multiple datasets including DNS Challenge 4, Common Voice, AudioSet, FSD50K, and the Jamendo dataset. It was trained for 300 epochs using 8 A100 GPUs, with Adam optimizer and a batch size of 64 examples.

Supports both streamable and non-streamable configurations
Operates at various bandwidths (1.5, 3, 6, and 12 kbps)
Includes weight normalization for convolution layers
Features a novel loss balancer mechanism for training stability

Core Capabilities

Real-time audio compression and decompression
High-fidelity audio reproduction
Bandwidth reduction of 25-40% with language model integration
Support for both speech and music processing
Multiple sampling rate compatibility

Frequently Asked Questions

Q: What makes this model unique?

EnCodec stands out for its real-time performance and superior audio quality, consistently outperforming baselines like Lyra-v2 and Opus. It achieves better performance at 3 kbps compared to Opus at 12 kbps, making it highly efficient.

Q: What are the recommended use cases?

The model is ideal for real-time audio compression applications, speech generation, music streaming, and text-to-speech tasks. It can be used directly or fine-tuned for specific audio processing needs in larger pipelines.

encodec_24khz