parakeet-ctc-1.1b

Maintained By
nvidia

Parakeet CTC 1.1B

PropertyValue
Parameter Count1.1 Billion
Model TypeAutomatic Speech Recognition (ASR)
ArchitectureFastConformer CTC
LicenseCC-BY-4.0
Input Format16kHz mono-channel audio (WAV)

What is parakeet-ctc-1.1b?

Parakeet CTC 1.1B is an advanced automatic speech recognition model jointly developed by NVIDIA NeMo and Suno.ai teams. It represents an XXL version of the FastConformer CTC architecture, specifically designed for transcribing English speech to lowercase text. The model has been trained on an extensive dataset of 64,000 hours of English speech, combining both private and public datasets.

Implementation Details

The model utilizes the FastConformer architecture, which is an optimized version of the Conformer model featuring 8x depthwise-separable convolutional downsampling. It employs CTC (Connectionist Temporal Classification) loss for training and integrates seamlessly with the NVIDIA NeMo toolkit for both inference and fine-tuning purposes.

  • Trained on diverse datasets including LibriSpeech, Fisher Corpus, Switchboard-1, and more
  • Uses SentencePiece Unigram tokenizer with 1024 vocabulary size
  • Achieves impressive WER scores across various test sets (e.g., 1.83% on LibriSpeech test-clean)
  • Supports real-time transcription of audio files

Core Capabilities

  • High-accuracy speech transcription to lowercase English text
  • Processing of 16kHz mono-channel audio files
  • Easy integration with NeMo toolkit for inference and fine-tuning
  • Robust performance across different speech domains
  • Support for batch processing of multiple audio files

Frequently Asked Questions

Q: What makes this model unique?

The model's massive scale (1.1B parameters) and extensive training data (64K hours) make it particularly robust for general-purpose English speech recognition. Its integration with NVIDIA's NeMo toolkit and impressive WER scores across various domains set it apart from other ASR models.

Q: What are the recommended use cases?

This model is ideal for applications requiring high-accuracy English speech transcription, including subtitle generation, voice command systems, and large-scale audio content processing. It's particularly suitable for scenarios where deployment through the NeMo toolkit is feasible and high accuracy is crucial.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.