diar_sortformer_4spk-v1

Property	Value
Author	NVIDIA
Architecture	Sortformer with NEST Encoder
License	CC-BY-NC-4.0
Paper	Sortformer: Seamless Integration of Speaker Diarization and ASR

What is diar_sortformer_4spk-v1?

The diar_sortformer_4spk-v1 is an advanced end-to-end neural model for speaker diarization developed by NVIDIA. It uses a novel approach to resolve the speaker permutation problem by following the arrival-time order of speech segments. The model can effectively handle up to 4 speakers in a conversation and processes single-channel audio sampled at 16kHz.

Implementation Details

The model architecture consists of an 18-layer NeMo Encoder for Speech Tasks (NEST) based on Fast-Conformer, followed by an 18-layer Transformer encoder with a hidden size of 192. The output layer includes two feedforward layers with 4 sigmoid outputs for each frame input. The model processes audio in 0.08-second frames and outputs speaker activity probabilities for up to 4 speakers.

Trained on 2030 hours of real conversations and 5150 hours of simulated mixtures
Achieves DER of 14.76% on DIHARD3-Eval with optimized post-processing
Processes audio at 437-1053x real-time factor on RTX A6000
Supports maximum recording duration of ~12 minutes on 48GB GPU

Core Capabilities

Speaker diarization for up to 4 speakers
Real-time activity probability detection
Optimized post-processing options
Integration with NVIDIA NeMo framework
Support for various input formats including single files and batch processing

Frequently Asked Questions

Q: What makes this model unique?

The model's unique approach to solving the speaker permutation problem and its end-to-end architecture make it stand out. It uses arrival-time order of speech segments and achieves state-of-the-art performance on various benchmark datasets.

Q: What are the recommended use cases?

The model is ideal for offline speaker diarization in scenarios with up to 4 speakers, such as meetings, interviews, and phone calls. It works best with English language content and clean audio conditions.