canary-1b

Maintained By
nvidia

Canary-1B

PropertyValue
Parameters1 Billion
LicenseCC-BY-NC-4.0
ArchitectureFastConformer-Transformer
PaperFast Conformer Paper

What is Canary-1B?

Canary-1B is NVIDIA's state-of-the-art multilingual speech model that combines automatic speech recognition (ASR) and translation capabilities. Built using the NeMo toolkit, it supports 4 languages (English, German, French, Spanish) and achieves impressive performance on multiple benchmarks.

Implementation Details

The model utilizes a FastConformer encoder and Transformer decoder architecture, with 24 layers each. It processes single-channel audio at 16kHz and employs concatenated SentencePiece tokenizers for each language. Trained on 85,000 hours of speech data, including public, Suno-collected, and in-house datasets.

  • Multi-task capability supporting both ASR and translation
  • Punctuation and capitalization control
  • Beam search decoding with configurable parameters
  • Dynamic batching support

Core Capabilities

  • ASR in English, German, French, and Spanish with WER as low as 3.99%
  • Speech-to-text translation between supported languages
  • BLEU scores up to 40.76 for translation tasks
  • Support for punctuation and capitalization options

Frequently Asked Questions

Q: What makes this model unique?

Canary-1B stands out for its multi-task capabilities, handling both ASR and translation in multiple languages with a single model, while achieving competitive performance across all tasks.

Q: What are the recommended use cases?

The model is ideal for multilingual speech transcription, cross-language speech translation, and applications requiring high-quality speech understanding in supported languages. It's particularly useful for scenarios needing both ASR and translation capabilities.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.