Parakeet TDT-CTC 110M
Property | Value |
---|---|
Parameter Count | 114M |
Model Type | ASR (Automatic Speech Recognition) |
Architecture | Hybrid FastConformer TDT-CTC |
License | CC-BY-4.0 |
Developer | NVIDIA NeMo and Suno.ai |
What is parakeet-tdt_ctc-110m?
Parakeet TDT-CTC 110M is an advanced speech recognition model jointly developed by NVIDIA NeMo and Suno.ai teams. It's specifically designed for English speech transcription with automatic punctuation and capitalization support. The model achieves remarkable performance with an average Real-Time Factor (RTFx) of approximately 5300 on A100 GPUs, making it one of the fastest ASR models available.
Implementation Details
The model implements a Hybrid FastConformer architecture with full attention capability, allowing it to process audio segments up to 20 minutes in length in a single pass. It was trained on an extensive dataset of 36,000 hours of English speech, combining both private (27K hours) and public (9K hours) datasets including LibriSpeech, Fisher Corpus, VCTK, and others.
- Uses 8x depthwise-separable convolutional downsampling
- Accepts 16kHz mono-channel audio input
- Implements both TDT and CTC decoding options
- Trained using NeMo toolkit for 20,000 steps
Core Capabilities
- High-speed transcription with RTFx of ~5300 on A100
- Handles up to 20-minute audio segments
- Automatic punctuation and capitalization
- Strong performance across multiple domains (WER: 2.4% on LibriSpeech test-clean)
- BPE tokenizer with 1024 vocabulary size
Frequently Asked Questions
Q: What makes this model unique?
This model's combination of fast inference speed (5300x real-time), long audio processing capability (20 minutes), and automatic punctuation/capitalization makes it particularly valuable for production environments where processing speed and accuracy are crucial.
Q: What are the recommended use cases?
The model is ideal for applications requiring real-time or batch transcription of English speech, particularly when punctuation and capitalization are needed. It's especially suitable for transcribing long-form content like lectures, meetings, or interviews.