wav2vec2-xlsr-300m-finnish-lm

Property	Value
Parameter Count	300 million
Model Type	Speech Recognition (ASR)
Architecture	Wav2Vec2 XLS-R
Training Data	275.6 hours of Finnish speech
Best WER (with LM)	8.16%

What is wav2vec2-xlsr-300m-finnish-lm?

This is a fine-tuned version of Facebook's wav2vec2-xls-r-300m model specifically adapted for Finnish Automatic Speech Recognition (ASR). The model leverages the powerful wav2vec 2.0 architecture, which was pretrained on 436k hours of multilingual speech data. It includes a Finnish KenLM language model for improved transcription accuracy during the decoding phase.

Implementation Details

The model was trained using a combination of datasets, with the majority (82.73%) coming from the Aalto Finnish Parliament ASR Corpus. Training was conducted on a Tesla V100 GPU using 8-bit Adam optimizer with a linear learning rate scheduler. The model achieved its best performance after 10 epochs of training.

Learning rate: 5e-04 with 500 warmup steps
Batch size: 32 for both training and evaluation
Mixed precision training with Native AMP
Includes fine-tuned acoustic model and KenLM language model

Core Capabilities

Transcription of Finnish speech to text
Optimal performance on audio clips up to 20 seconds
Strong performance on formal Finnish speech
WER of 8.16% with language model, 17.92% without
Supports real-time transcription with appropriate audio chunking

Frequently Asked Questions

Q: What makes this model unique?

This model combines a large-scale multilingual speech model with Finnish-specific training data and a custom language model, making it particularly effective for Finnish ASR tasks. It's especially strong in formal Finnish speech recognition, thanks to its training on parliamentary proceedings.

Q: What are the recommended use cases?

The model is best suited for transcribing formal Finnish speech, particularly in professional or official contexts. It works optimally with audio clips up to 20 seconds in length and performs best with clear, standard Finnish pronunciation rather than heavy dialects or informal speech.