parakeet-tdt_ctc-110m

Maintained By
nvidia

Parakeet TDT-CTC 110M

PropertyValue
Parameter Count114M
Model TypeASR (Automatic Speech Recognition)
ArchitectureHybrid FastConformer TDT-CTC
LicenseCC-BY-4.0
DeveloperNVIDIA NeMo and Suno.ai

What is parakeet-tdt_ctc-110m?

Parakeet TDT-CTC 110M is an advanced speech recognition model jointly developed by NVIDIA NeMo and Suno.ai teams. It's specifically designed for English speech transcription with automatic punctuation and capitalization support. The model achieves remarkable performance with an average Real-Time Factor (RTFx) of approximately 5300 on A100 GPUs, making it one of the fastest ASR models available.

Implementation Details

The model implements a Hybrid FastConformer architecture with full attention capability, allowing it to process audio segments up to 20 minutes in length in a single pass. It was trained on an extensive dataset of 36,000 hours of English speech, combining both private (27K hours) and public (9K hours) datasets including LibriSpeech, Fisher Corpus, VCTK, and others.

  • Uses 8x depthwise-separable convolutional downsampling
  • Accepts 16kHz mono-channel audio input
  • Implements both TDT and CTC decoding options
  • Trained using NeMo toolkit for 20,000 steps

Core Capabilities

  • High-speed transcription with RTFx of ~5300 on A100
  • Handles up to 20-minute audio segments
  • Automatic punctuation and capitalization
  • Strong performance across multiple domains (WER: 2.4% on LibriSpeech test-clean)
  • BPE tokenizer with 1024 vocabulary size

Frequently Asked Questions

Q: What makes this model unique?

This model's combination of fast inference speed (5300x real-time), long audio processing capability (20 minutes), and automatic punctuation/capitalization makes it particularly valuable for production environments where processing speed and accuracy are crucial.

Q: What are the recommended use cases?

The model is ideal for applications requiring real-time or batch transcription of English speech, particularly when punctuation and capitalization are needed. It's especially suitable for transcribing long-form content like lectures, meetings, or interviews.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.