wav2vec2-xlsr-1b-finnish-lm

Finnish-NLP

Large-scale Finnish ASR model (1B parameters) fine-tuned on 259.57 hours of Finnish speech data, achieving 5.65% WER with language model integration

Property	Value
Parameter Count	1 billion
Training Data	259.57 hours of Finnish speech
Base Model	facebook/wav2vec2-xls-r-1b
Best WER	5.65% (with LM)

What is wav2vec2-xlsr-1b-finnish-lm?

This is a state-of-the-art Finnish Automatic Speech Recognition (ASR) model based on Facebook's wav2vec2-xls-r-1b architecture. The model has been fine-tuned on 259.57 hours of Finnish speech data, primarily from the Finnish Parliament corpus, and includes a specialized Finnish KenLM language model for improved transcription accuracy.

Implementation Details

The model builds upon the wav2vec 2.0 architecture, pre-trained on 436k hours of multilingual speech data. It was fine-tuned using 8-bit Adam optimizer with a learning rate of 5e-05 and linear scheduler warmup over 500 steps. The training process included 5 epochs with mixed-precision training and utilized a Tesla V100 GPU.

Trained on diverse Finnish speech datasets, with 87.84% from Parliament recordings
Implements a 5-gram KenLM language model for enhanced accuracy
Maximum audio length support of 20 seconds per sample
Achieves 5.65% WER on Common Voice 7.0 test set with language model

Core Capabilities

Finnish speech-to-text transcription
Optimized for formal Finnish speech recognition
Supports both with and without language model decoding
Handles various Finnish speech contexts with focus on parliamentary speech

Frequently Asked Questions

Q: What makes this model unique?

This model combines a large-scale multilingual transformer architecture with specialized Finnish language training, making it one of the most powerful Finnish ASR models available. Its integration with a custom KenLM language model significantly improves transcription accuracy.

Q: What are the recommended use cases?

The model is best suited for transcribing formal Finnish speech, particularly in professional or official contexts. It performs optimally with audio clips under 20 seconds and may require audio chunking for longer recordings. It's particularly effective for clear, standard Finnish speech rather than heavily dialectical or informal conversations.