wav2vec2-xlsr-1b-finnish-lm

Maintained By
Finnish-NLP

wav2vec2-xlsr-1b-finnish-lm

PropertyValue
Parameter Count1 billion
Training Data259.57 hours of Finnish speech
Base Modelfacebook/wav2vec2-xls-r-1b
Best WER5.65% (with LM)

What is wav2vec2-xlsr-1b-finnish-lm?

This is a state-of-the-art Finnish Automatic Speech Recognition (ASR) model based on Facebook's wav2vec2-xls-r-1b architecture. The model has been fine-tuned on 259.57 hours of Finnish speech data, primarily from the Finnish Parliament corpus, and includes a specialized Finnish KenLM language model for improved transcription accuracy.

Implementation Details

The model builds upon the wav2vec 2.0 architecture, pre-trained on 436k hours of multilingual speech data. It was fine-tuned using 8-bit Adam optimizer with a learning rate of 5e-05 and linear scheduler warmup over 500 steps. The training process included 5 epochs with mixed-precision training and utilized a Tesla V100 GPU.

  • Trained on diverse Finnish speech datasets, with 87.84% from Parliament recordings
  • Implements a 5-gram KenLM language model for enhanced accuracy
  • Maximum audio length support of 20 seconds per sample
  • Achieves 5.65% WER on Common Voice 7.0 test set with language model

Core Capabilities

  • Finnish speech-to-text transcription
  • Optimized for formal Finnish speech recognition
  • Supports both with and without language model decoding
  • Handles various Finnish speech contexts with focus on parliamentary speech

Frequently Asked Questions

Q: What makes this model unique?

This model combines a large-scale multilingual transformer architecture with specialized Finnish language training, making it one of the most powerful Finnish ASR models available. Its integration with a custom KenLM language model significantly improves transcription accuracy.

Q: What are the recommended use cases?

The model is best suited for transcribing formal Finnish speech, particularly in professional or official contexts. It performs optimally with audio clips under 20 seconds and may require audio chunking for longer recordings. It's particularly effective for clear, standard Finnish speech rather than heavily dialectical or informal conversations.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.