Canary-1B

Property	Value
Parameters	1 Billion
License	CC-BY-NC-4.0
Architecture	FastConformer-Transformer
Paper	Fast Conformer Paper

What is Canary-1B?

Canary-1B is NVIDIA's state-of-the-art multilingual speech model that combines automatic speech recognition (ASR) and translation capabilities. Built using the NeMo toolkit, it supports 4 languages (English, German, French, Spanish) and achieves impressive performance on multiple benchmarks.

Implementation Details

The model utilizes a FastConformer encoder and Transformer decoder architecture, with 24 layers each. It processes single-channel audio at 16kHz and employs concatenated SentencePiece tokenizers for each language. Trained on 85,000 hours of speech data, including public, Suno-collected, and in-house datasets.

Multi-task capability supporting both ASR and translation
Punctuation and capitalization control
Beam search decoding with configurable parameters
Dynamic batching support

Core Capabilities

ASR in English, German, French, and Spanish with WER as low as 3.99%
Speech-to-text translation between supported languages
BLEU scores up to 40.76 for translation tasks
Support for punctuation and capitalization options

Frequently Asked Questions

Q: What makes this model unique?

Canary-1B stands out for its multi-task capabilities, handling both ASR and translation in multiple languages with a single model, while achieving competitive performance across all tasks.

Q: What are the recommended use cases?

The model is ideal for multilingual speech transcription, cross-language speech translation, and applications requiring high-quality speech understanding in supported languages. It's particularly useful for scenarios needing both ASR and translation capabilities.

canary-1b