NVIDIA TitaNet-Large Speaker Verification Model
Property | Value |
---|---|
Parameter Count | 23M |
License | CC-BY-4.0 |
Language | English |
Performance | 0.66% EER on VoxCeleb1 |
What is speakerverification_en_titanet_large?
TitaNet-Large is a sophisticated speaker verification model developed by NVIDIA that uses depth-wise separable conv1D architecture. This model is designed to extract speaker embeddings from speech input, serving as a backbone for speaker verification and diarization tasks. With 23M parameters, it represents the "large" version of the TitaNet architecture family.
Implementation Details
The model is implemented using the NVIDIA NeMo toolkit and operates on 16kHz mono-channel audio input. It utilizes depth-wise separable convolutions and global context to generate speaker embeddings. The model was trained on an extensive dataset combination including VoxCeleb-1, VoxCeleb-2, Fisher, Switchboard, Librispeech, and SRE.
- Achieves 0.66% EER on VoxCeleb1 cleaned trial file
- Demonstrates strong diarization performance with DER as low as 1.19% on CH109 dataset
- Supports both telephonic and non-telephonic speech processing
Core Capabilities
- Speaker embedding extraction from audio files
- Speaker verification between two utterances
- Batch processing for multiple audio files
- Support for speaker diarization tasks
Frequently Asked Questions
Q: What makes this model unique?
The model's architecture combining depth-wise separable convolutions with global context, along with its extensive training on six diverse datasets, makes it particularly robust for speaker verification tasks. Its performance metrics, especially the 0.66% EER on VoxCeleb1, demonstrate state-of-the-art capabilities.
Q: What are the recommended use cases?
The model is ideal for speaker verification in security systems, speaker diarization in meeting transcriptions, and voice-based authentication systems. It performs well in both telephonic and non-telephonic environments, though fine-tuning might be necessary for specific domains.