whisper-large-v3-french

bofenghuang

Optimized French speech recognition model based on Whisper Large V3, achieving WER of 3.98-8.91% across datasets. 1.61B parameters, MIT licensed.

Property	Value
Parameter Count	1.61B
License	MIT
Paper	Whisper Paper
Model Type	Speech Recognition (ASR)

What is whisper-large-v3-french?

Whisper-Large-V3-French is a specialized speech recognition model fine-tuned from OpenAI's Whisper Large V3 architecture, specifically optimized for French language processing. The model demonstrates exceptional performance across various French speech recognition tasks, with Word Error Rates (WER) ranging from 3.98% to 8.91% on different benchmark datasets.

Implementation Details

The model was trained on over 2,500 hours of French speech data, incorporating multiple datasets including Common Voice, Multilingual LibriSpeech, and VoxPopuli. It features advanced capabilities for predicting casing, punctuation, and numbers in transcriptions.

Supports multiple implementation frameworks including Hugging Face Transformers, OpenAI Whisper, and Faster Whisper
Includes speculative decoding support for 2x faster inference
Compatible with various deployment options including CPU and GPU implementations

Core Capabilities

High-accuracy French speech transcription with WER as low as 3.98% on MLS dataset
Robust performance on both short-form and long-form audio
Handles various French accents including African French variants
Supports parallel processing for long audio files

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its specialized optimization for French language processing, achieving state-of-the-art performance while maintaining the ability to handle casing, punctuation, and numerical transcription. It's been extensively tested on both in-distribution and out-of-distribution datasets, proving its robustness across different use cases.

Q: What are the recommended use cases?

The model is ideal for French speech transcription tasks including call center conversations, academic lectures, media content, and general speech recognition applications. It performs particularly well in both short-form (< 30 seconds) and long-form transcription scenarios.