wav2vec2-large-robust-ft-libritts-voxpopuli

jbetker

A wav2vec2-large model specialized in speech transcription with punctuation, achieving 4.45% WER on LibriSpeech, ideal for TTS applications

Property	Value
Author	jbetker
Downloads	632,156
Architecture	wav2vec2-large
Base Model	facebook/wav2vec2-large-robust-ft-libri-960h

What is wav2vec2-large-robust-ft-libritts-voxpopuli?

This is a specialized speech recognition model built on the wav2vec2-large architecture, specifically designed for generating transcriptions with punctuation. It's a fine-tuned version of the Facebook wav2vec2 model, trained on both LibriTTS and VoxPopuli datasets to achieve superior punctuation awareness in speech transcription.

Implementation Details

The model is built upon the robust wav2vec2-large architecture and achieves a Word Error Rate (WER) of 4.45% on the LibriSpeech validation set, coming close to its baseline model's 4.3%. It incorporates a custom vocabulary that includes punctuation marks, making it particularly valuable for Text-to-Speech (TTS) applications.

Fine-tuned on clean audio from LibriTTS and VoxPopuli datasets
Custom vocabulary with punctuation support
Compatible with the Transformers library and PyTorch
Optimized for clean audio processing

Core Capabilities

High-accuracy speech transcription with punctuation
Excellent performance on clean audio sources
Specialized for TTS model training
Robust performance with 4.45% WER on LibriSpeech

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its ability to generate transcriptions with accurate punctuation, which is crucial for TTS applications. The custom vocabulary and specialized training on LibriTTS and VoxPopuli datasets make it particularly effective for clean audio transcription tasks.

Q: What are the recommended use cases?

The model is best suited for: 1) Generating transcriptions for TTS model training, 2) Clean audio transcription tasks requiring punctuation, 3) Applications where prosody and punctuation accuracy are crucial. Note that it may not perform optimally on noisy audio sources.