stt_zh_conformer_transducer_large

stt_zh_conformer_transducer_large

nvidia

Large-scale Mandarin speech recognition model (120M params) using Conformer-Transducer architecture. Achieves 5.3-5.7% WER on AISHELL-2. Supports 16kHz mono audio.

PropertyValue
Model Size120M parameters
Input Format16kHz Mono Audio (WAV)
Vocabulary Size5026 characters
LicenseCC-BY-4.0
Training DatasetAISHELL-2
Best WER5.3% (Test IOS)

What is stt_zh_conformer_transducer_large?

This is NVIDIA's large-scale speech recognition model specifically designed for Mandarin Chinese transcription. Built on the Conformer-Transducer architecture, it combines convolution-augmented transformer technology with transducer-based decoding to achieve state-of-the-art performance in Mandarin speech recognition.

Implementation Details

The model utilizes the NeMo toolkit for both training and inference, featuring a character-based tokenization system with a vocabulary of 5026 characters. It processes 16kHz mono audio input and outputs text transcriptions directly in Mandarin characters.

  • Trained on the comprehensive AISHELL-2 dataset
  • Achieves 5.3-5.7% Word Error Rate across different test conditions
  • Implements autoregressive decoding with transducer loss
  • Supports easy integration through NeMo toolkit

Core Capabilities

  • High-accuracy Mandarin speech transcription
  • Batch processing of multiple audio files
  • Simple Python API integration
  • Support for different audio input environments (iOS, Android, Mic)

Frequently Asked Questions

Q: What makes this model unique?

This model combines the powerful Conformer architecture with transducer-based decoding, specifically optimized for Mandarin Chinese. Its large parameter count (120M) and extensive training on AISHELL-2 enable superior performance across different recording conditions.

Q: What are the recommended use cases?

The model is ideal for Mandarin speech transcription in applications requiring high accuracy, such as automated transcription services, voice assistants, and speech analytics platforms. However, it may have limitations with technical terms or heavily accented speech not present in the training data.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026