LibriSpeech-100 E-Branchformer ASR Model

Property	Value
Author	pyf98
License	CC-BY-4.0
Paper	E-Branchformer Paper
Framework	ESPnet

What is librispeech_100_e_branchformer?

This is an automatic speech recognition (ASR) model that implements the E-Branchformer architecture, trained on the LibriSpeech-100 dataset. The model combines parallel MLP and attention mechanisms to effectively capture both local and global context in speech recognition tasks.

Implementation Details

The model uses a sophisticated architecture with 12 E-Branchformer blocks, each containing attention heads and MLP layers. Key specifications include: output size of 256, 4 attention heads, and 1024 linear units. The model employs both CTC and attention-based decoding with a CTC weight of 0.3.

Encoder: E-Branchformer with 12 blocks
Decoder: Transformer with 6 blocks
Frontend: 512-point FFT with 400ms window
SpecAugment: Time warping and masking enabled

Core Capabilities

Achieves 94.4% accuracy on LibriSpeech test-clean
85.0% accuracy on more challenging test-other set
Effective handling of both clean and noisy speech
Robust performance across different speaking styles

Frequently Asked Questions

Q: What makes this model unique?

The E-Branchformer architecture uniquely combines MLP and attention mechanisms in parallel, enhanced with a specialized merging strategy for better speech recognition performance.

Q: What are the recommended use cases?

This model is ideal for English speech recognition tasks, particularly in scenarios requiring high accuracy on clean speech while maintaining reasonable performance on more challenging audio conditions.