parakeet-rnnt-1.1b

Maintained By
nvidia

Parakeet RNNT 1.1B

PropertyValue
Parameter Count1.1 Billion
Model TypeAutomatic Speech Recognition (ASR)
ArchitectureFastConformer Transducer
LicenseCC-BY-4.0
Training Data64K hours of English speech

What is parakeet-rnnt-1.1b?

Parakeet RNNT 1.1B is a state-of-the-art automatic speech recognition model jointly developed by NVIDIA NeMo and Suno.ai teams. It represents an XXL version of the FastConformer Transducer architecture, designed to transcribe speech into lowercase English text with exceptional accuracy. The model has been trained on an extensive dataset of 64,000 hours of English speech, including both private and public datasets.

Implementation Details

The model is built on the FastConformer architecture, which features 8x depthwise-separable convolutional downsampling and is optimized for efficient speech recognition. It processes 16kHz mono-channel audio and outputs text transcriptions. The model utilizes a SentencePiece Unigram tokenizer with a vocabulary size of 1024.

  • Trained using NVIDIA NeMo toolkit
  • Implements multitask setup with Transducer decoder (RNNT) loss
  • Achieves impressive WER scores across various datasets (e.g., 1.46% on LibriSpeech test-clean)
  • Compatible with both Python API and command-line interface

Core Capabilities

  • High-accuracy speech transcription to lowercase English text
  • Processes 16kHz mono-channel audio files
  • Efficient processing with optimized architecture
  • Supports batch processing of multiple audio files
  • Easy integration through NeMo toolkit

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its large-scale architecture (1.1B parameters) and extensive training data (64K hours), combining both private and public datasets. Its implementation of the FastConformer architecture with 8x downsampling makes it both accurate and efficient.

Q: What are the recommended use cases?

The model is ideal for large-scale speech recognition tasks requiring high accuracy, particularly in English language applications. It's suitable for both research and production environments, especially when deployed through the NVIDIA NeMo toolkit or NVIDIA Riva framework.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.