diar_sortformer_4spk-v1

diar_sortformer_4spk-v1

nvidia

Sortformer model for speaker diarization supporting up to 4 speakers with high accuracy. Uses Fast-Conformer architecture and achieves DER of 14.76% on DIHARD3-Eval dataset.

PropertyValue
AuthorNVIDIA
ArchitectureSortformer with NEST Encoder
LicenseCC-BY-NC-4.0
PaperSortformer: Seamless Integration of Speaker Diarization and ASR

What is diar_sortformer_4spk-v1?

The diar_sortformer_4spk-v1 is an advanced end-to-end neural model for speaker diarization developed by NVIDIA. It uses a novel approach to resolve the speaker permutation problem by following the arrival-time order of speech segments. The model can effectively handle up to 4 speakers in a conversation and processes single-channel audio sampled at 16kHz.

Implementation Details

The model architecture consists of an 18-layer NeMo Encoder for Speech Tasks (NEST) based on Fast-Conformer, followed by an 18-layer Transformer encoder with a hidden size of 192. The output layer includes two feedforward layers with 4 sigmoid outputs for each frame input. The model processes audio in 0.08-second frames and outputs speaker activity probabilities for up to 4 speakers.

  • Trained on 2030 hours of real conversations and 5150 hours of simulated mixtures
  • Achieves DER of 14.76% on DIHARD3-Eval with optimized post-processing
  • Processes audio at 437-1053x real-time factor on RTX A6000
  • Supports maximum recording duration of ~12 minutes on 48GB GPU

Core Capabilities

  • Speaker diarization for up to 4 speakers
  • Real-time activity probability detection
  • Optimized post-processing options
  • Integration with NVIDIA NeMo framework
  • Support for various input formats including single files and batch processing

Frequently Asked Questions

Q: What makes this model unique?

The model's unique approach to solving the speaker permutation problem and its end-to-end architecture make it stand out. It uses arrival-time order of speech segments and achieves state-of-the-art performance on various benchmark datasets.

Q: What are the recommended use cases?

The model is ideal for offline speaker diarization in scenarios with up to 4 speakers, such as meetings, interviews, and phone calls. It works best with English language content and clean audio conditions.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026