diar_sortformer_4spk-v1
Property | Value |
---|---|
Author | NVIDIA |
Architecture | Sortformer with NEST Encoder |
License | CC-BY-NC-4.0 |
Paper | Sortformer: Seamless Integration of Speaker Diarization and ASR |
What is diar_sortformer_4spk-v1?
The diar_sortformer_4spk-v1 is an advanced end-to-end neural model for speaker diarization developed by NVIDIA. It uses a novel approach to resolve the speaker permutation problem by following the arrival-time order of speech segments. The model can effectively handle up to 4 speakers in a conversation and processes single-channel audio sampled at 16kHz.
Implementation Details
The model architecture consists of an 18-layer NeMo Encoder for Speech Tasks (NEST) based on Fast-Conformer, followed by an 18-layer Transformer encoder with a hidden size of 192. The output layer includes two feedforward layers with 4 sigmoid outputs for each frame input. The model processes audio in 0.08-second frames and outputs speaker activity probabilities for up to 4 speakers.
- Trained on 2030 hours of real conversations and 5150 hours of simulated mixtures
- Achieves DER of 14.76% on DIHARD3-Eval with optimized post-processing
- Processes audio at 437-1053x real-time factor on RTX A6000
- Supports maximum recording duration of ~12 minutes on 48GB GPU
Core Capabilities
- Speaker diarization for up to 4 speakers
- Real-time activity probability detection
- Optimized post-processing options
- Integration with NVIDIA NeMo framework
- Support for various input formats including single files and batch processing
Frequently Asked Questions
Q: What makes this model unique?
The model's unique approach to solving the speaker permutation problem and its end-to-end architecture make it stand out. It uses arrival-time order of speech segments and achieves state-of-the-art performance on various benchmark datasets.
Q: What are the recommended use cases?
The model is ideal for offline speaker diarization in scenarios with up to 4 speakers, such as meetings, interviews, and phone calls. It works best with English language content and clean audio conditions.