wav2vec2-large-xlsr-open-brazilian-portuguese-v2

lgris

A fine-tuned Wav2vec 2.0 model for Brazilian Portuguese ASR, achieving 10.69% WER on Common Voice test set, trained on 5 major Portuguese speech datasets.

Property	Value
License	Apache 2.0
Paper	Research Paper
Author	lgris
Test WER	10.69%

What is wav2vec2-large-xlsr-open-brazilian-portuguese-v2?

This is a specialized speech recognition model fine-tuned for Brazilian Portuguese using the Wav2vec 2.0 architecture. The model was trained on an extensive combination of five major Portuguese speech datasets: CETUC (145 hours), Multilingual LibriSpeech (284 hours), VoxForge, Common Voice 6.1, and Lapsbm, creating one of the most comprehensive Brazilian Portuguese ASR models available.

Implementation Details

The model is built upon the wav2vec 2.0 architecture and has been fine-tuned using fairseq before being converted for easier deployment. It achieves a Word Error Rate (WER) of 10.69% on the Common Voice test set and 34.53% on out-of-domain TEDx data, demonstrating robust performance for in-domain applications.

Trained on multiple high-quality datasets totaling over 400 hours of speech
Supports 16kHz audio input
Implements CTC-based speech recognition
Optimized for Brazilian Portuguese variants

Core Capabilities

Automatic Speech Recognition for Brazilian Portuguese
Handles varied speaking styles and accents
Supports both clean and real-world audio conditions
Processes audio at 16kHz sampling rate

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its comprehensive training on five different Brazilian Portuguese datasets, making it particularly robust for real-world applications. The combination of academic (CETUC), crowd-sourced (Common Voice), and audiobook (MLS) data provides excellent coverage of different speaking styles and contexts.

Q: What are the recommended use cases?

The model is best suited for Brazilian Portuguese speech recognition tasks in controlled environments, particularly for applications requiring transcription of clear speech. It performs especially well on in-domain data similar to its training sets, making it ideal for applications like voice commands, transcription services, and automated subtitling for Brazilian Portuguese content.