canary-180m-flash

canary-180m-flash

nvidia

NVIDIA's 182M parameter multilingual speech model supporting ASR & translation across English, German, French & Spanish with high accuracy (1200+ RTFx) and timestamp capabilities

PropertyValue
Parameter Count182 Million
LicenseCC-BY-4.0
ArchitectureFastConformer Encoder + Transformer Decoder
DeveloperNVIDIA
Training Data85,000 hours of multilingual speech

What is canary-180m-flash?

Canary-180m-flash is NVIDIA's state-of-the-art multilingual speech model that achieves remarkable performance in automatic speech recognition (ASR) and translation. With 182 million parameters, it processes audio at over 1200 times real-time speed while supporting four languages: English, German, French, and Spanish. The model excels in both ASR and translation tasks, featuring innovative capabilities like word-level timestamps and punctuation prediction.

Implementation Details

The model utilizes a FastConformer encoder coupled with a Transformer decoder architecture, incorporating 17 encoder layers and 4 decoder layers. It employs a concatenated tokenizer built from individual SentencePiece tokenizers for each supported language, enabling efficient multilingual processing. The model operates on 16kHz mono-channel audio and can handle various input formats including .wav and .flac files.

  • Trained on 85,000 hours of diverse speech data
  • Supports bidirectional translation between English and German/French/Spanish
  • Features automatic punctuation and capitalization
  • Provides word-level and segment-level timestamp capabilities

Core Capabilities

  • High-speed ASR with 1200+ RTFx on modern GPUs
  • Multilingual speech recognition with state-of-the-art accuracy
  • Speech-to-text translation across supported language pairs
  • Timestamp generation for precise word alignment
  • Support for long-form audio through chunked processing

Frequently Asked Questions

Q: What makes this model unique?

The model combines high performance with impressive speed, achieving state-of-the-art results while maintaining real-time factor exceeding 1200x. Its ability to handle multiple languages and tasks within a relatively compact 182M parameter architecture makes it particularly efficient and versatile.

Q: What are the recommended use cases?

The model is ideal for applications requiring fast, accurate speech transcription and translation, including media subtitling, content localization, and real-time speech processing. It's particularly suited for scenarios requiring timestamp information or handling multiple languages within the supported set.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026