parakeet-ctc-1.1b

nvidia

Parakeet CTC 1.1B is a large-scale ASR model (1.1B parameters) for English speech transcription, using FastConformer architecture with CTC loss, trained on 64K hours of speech data.

Property	Value
Parameter Count	1.1 Billion
Model Type	Automatic Speech Recognition (ASR)
Architecture	FastConformer CTC
License	CC-BY-4.0
Input Format	16kHz mono-channel audio (WAV)

What is parakeet-ctc-1.1b?

Parakeet CTC 1.1B is an advanced automatic speech recognition model jointly developed by NVIDIA NeMo and Suno.ai teams. It represents an XXL version of the FastConformer CTC architecture, specifically designed for transcribing English speech to lowercase text. The model has been trained on an extensive dataset of 64,000 hours of English speech, combining both private and public datasets.

Implementation Details

The model utilizes the FastConformer architecture, which is an optimized version of the Conformer model featuring 8x depthwise-separable convolutional downsampling. It employs CTC (Connectionist Temporal Classification) loss for training and integrates seamlessly with the NVIDIA NeMo toolkit for both inference and fine-tuning purposes.

Trained on diverse datasets including LibriSpeech, Fisher Corpus, Switchboard-1, and more
Uses SentencePiece Unigram tokenizer with 1024 vocabulary size
Achieves impressive WER scores across various test sets (e.g., 1.83% on LibriSpeech test-clean)
Supports real-time transcription of audio files

Core Capabilities

High-accuracy speech transcription to lowercase English text
Processing of 16kHz mono-channel audio files
Easy integration with NeMo toolkit for inference and fine-tuning
Robust performance across different speech domains
Support for batch processing of multiple audio files

Frequently Asked Questions

Q: What makes this model unique?

The model's massive scale (1.1B parameters) and extensive training data (64K hours) make it particularly robust for general-purpose English speech recognition. Its integration with NVIDIA's NeMo toolkit and impressive WER scores across various domains set it apart from other ASR models.

Q: What are the recommended use cases?

This model is ideal for applications requiring high-accuracy English speech transcription, including subtitle generation, voice command systems, and large-scale audio content processing. It's particularly suitable for scenarios where deployment through the NeMo toolkit is feasible and high accuracy is crucial.