diar_sortformer_4spk-v1

Maintained By
nvidia

diar_sortformer_4spk-v1

PropertyValue
AuthorNVIDIA
ArchitectureSortformer with NEST Encoder
LicenseCC-BY-NC-4.0
PaperSortformer: Seamless Integration of Speaker Diarization and ASR

What is diar_sortformer_4spk-v1?

The diar_sortformer_4spk-v1 is an advanced end-to-end neural model for speaker diarization developed by NVIDIA. It uses a novel approach to resolve the speaker permutation problem by following the arrival-time order of speech segments. The model can effectively handle up to 4 speakers in a conversation and processes single-channel audio sampled at 16kHz.

Implementation Details

The model architecture consists of an 18-layer NeMo Encoder for Speech Tasks (NEST) based on Fast-Conformer, followed by an 18-layer Transformer encoder with a hidden size of 192. The output layer includes two feedforward layers with 4 sigmoid outputs for each frame input. The model processes audio in 0.08-second frames and outputs speaker activity probabilities for up to 4 speakers.

  • Trained on 2030 hours of real conversations and 5150 hours of simulated mixtures
  • Achieves DER of 14.76% on DIHARD3-Eval with optimized post-processing
  • Processes audio at 437-1053x real-time factor on RTX A6000
  • Supports maximum recording duration of ~12 minutes on 48GB GPU

Core Capabilities

  • Speaker diarization for up to 4 speakers
  • Real-time activity probability detection
  • Optimized post-processing options
  • Integration with NVIDIA NeMo framework
  • Support for various input formats including single files and batch processing

Frequently Asked Questions

Q: What makes this model unique?

The model's unique approach to solving the speaker permutation problem and its end-to-end architecture make it stand out. It uses arrival-time order of speech segments and achieves state-of-the-art performance on various benchmark datasets.

Q: What are the recommended use cases?

The model is ideal for offline speaker diarization in scenarios with up to 4 speakers, such as meetings, interviews, and phone calls. It works best with English language content and clean audio conditions.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.