Canary-1B
Property | Value |
---|---|
Parameters | 1 Billion |
License | CC-BY-NC-4.0 |
Architecture | FastConformer-Transformer |
Paper | Fast Conformer Paper |
What is Canary-1B?
Canary-1B is NVIDIA's state-of-the-art multilingual speech model that combines automatic speech recognition (ASR) and translation capabilities. Built using the NeMo toolkit, it supports 4 languages (English, German, French, Spanish) and achieves impressive performance on multiple benchmarks.
Implementation Details
The model utilizes a FastConformer encoder and Transformer decoder architecture, with 24 layers each. It processes single-channel audio at 16kHz and employs concatenated SentencePiece tokenizers for each language. Trained on 85,000 hours of speech data, including public, Suno-collected, and in-house datasets.
- Multi-task capability supporting both ASR and translation
- Punctuation and capitalization control
- Beam search decoding with configurable parameters
- Dynamic batching support
Core Capabilities
- ASR in English, German, French, and Spanish with WER as low as 3.99%
- Speech-to-text translation between supported languages
- BLEU scores up to 40.76 for translation tasks
- Support for punctuation and capitalization options
Frequently Asked Questions
Q: What makes this model unique?
Canary-1B stands out for its multi-task capabilities, handling both ASR and translation in multiple languages with a single model, while achieving competitive performance across all tasks.
Q: What are the recommended use cases?
The model is ideal for multilingual speech transcription, cross-language speech translation, and applications requiring high-quality speech understanding in supported languages. It's particularly useful for scenarios needing both ASR and translation capabilities.