parakeet-tdt_ctc-110m

nvidia

ASR model with 110M parameters for English speech transcription. Features punctuation/capitalization support, 5300x real-time speed on A100, and 20-min audio processing capability.

Property	Value
Parameter Count	114M
Model Type	ASR (Automatic Speech Recognition)
Architecture	Hybrid FastConformer TDT-CTC
License	CC-BY-4.0
Developer	NVIDIA NeMo and Suno.ai

What is parakeet-tdt_ctc-110m?

Parakeet TDT-CTC 110M is an advanced speech recognition model jointly developed by NVIDIA NeMo and Suno.ai teams. It's specifically designed for English speech transcription with automatic punctuation and capitalization support. The model achieves remarkable performance with an average Real-Time Factor (RTFx) of approximately 5300 on A100 GPUs, making it one of the fastest ASR models available.

Implementation Details

The model implements a Hybrid FastConformer architecture with full attention capability, allowing it to process audio segments up to 20 minutes in length in a single pass. It was trained on an extensive dataset of 36,000 hours of English speech, combining both private (27K hours) and public (9K hours) datasets including LibriSpeech, Fisher Corpus, VCTK, and others.

Uses 8x depthwise-separable convolutional downsampling
Accepts 16kHz mono-channel audio input
Implements both TDT and CTC decoding options
Trained using NeMo toolkit for 20,000 steps

Core Capabilities

High-speed transcription with RTFx of ~5300 on A100
Handles up to 20-minute audio segments
Automatic punctuation and capitalization
Strong performance across multiple domains (WER: 2.4% on LibriSpeech test-clean)
BPE tokenizer with 1024 vocabulary size

Frequently Asked Questions

Q: What makes this model unique?

This model's combination of fast inference speed (5300x real-time), long audio processing capability (20 minutes), and automatic punctuation/capitalization makes it particularly valuable for production environments where processing speed and accuracy are crucial.

Q: What are the recommended use cases?

The model is ideal for applications requiring real-time or batch transcription of English speech, particularly when punctuation and capitalization are needed. It's especially suitable for transcribing long-form content like lectures, meetings, or interviews.