canary-1b-flash

Maintained By
nvidia

Canary-1B-Flash

PropertyValue
Parameters883 Million
ArchitectureFastConformer Encoder + Transformer Decoder
LicenseCC-BY-4.0
Training Data85,000 hours of multilingual speech
Supported LanguagesEnglish, German, French, Spanish

What is canary-1b-flash?

Canary-1B-Flash is NVIDIA's state-of-the-art multilingual speech model that excels in automatic speech recognition (ASR) and translation. With remarkable inference speeds exceeding 1000 RTFx, it handles speech processing tasks across four major languages while maintaining high accuracy. The model features innovative capabilities like word-level timestamps and punctuation handling.

Implementation Details

Built on NVIDIA's NeMo framework, the model utilizes a FastConformer encoder with 32 layers and a Transformer decoder with 4 layers. It employs a concatenated tokenizer system for efficient multilingual processing and supports various input formats including .wav and .flac files.

  • Trained on 85K hours of diverse speech data
  • Achieves impressive WER scores (e.g., 1.48 on LibriSpeech Clean)
  • Supports batch processing and long-form audio handling
  • Features robust noise handling capabilities

Core Capabilities

  • Multi-language ASR with punctuation and capitalization
  • Cross-language translation (e.g., English to German/French/Spanish)
  • Word and segment-level timestamp generation
  • Long-form audio processing with automatic chunking
  • High-speed inference across various NVIDIA GPUs

Frequently Asked Questions

Q: What makes this model unique?

The combination of high inference speed (1000+ RTFx), multilingual capabilities, and additional features like timestamps and punctuation make it particularly versatile. Its efficient architecture and robust performance across different audio conditions set it apart from other speech models.

Q: What are the recommended use cases?

The model excels in enterprise-scale speech transcription, multilingual content processing, and cross-language translation tasks. It's particularly suitable for applications requiring real-time processing or handling large volumes of audio content across multiple languages.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.