wav2vec2-xlsr-300m-finnish-lm

Maintained By
Finnish-NLP

wav2vec2-xlsr-300m-finnish-lm

PropertyValue
Parameter Count300 million
Model TypeSpeech Recognition (ASR)
ArchitectureWav2Vec2 XLS-R
Training Data275.6 hours of Finnish speech
Best WER (with LM)8.16%

What is wav2vec2-xlsr-300m-finnish-lm?

This is a fine-tuned version of Facebook's wav2vec2-xls-r-300m model specifically adapted for Finnish Automatic Speech Recognition (ASR). The model leverages the powerful wav2vec 2.0 architecture, which was pretrained on 436k hours of multilingual speech data. It includes a Finnish KenLM language model for improved transcription accuracy during the decoding phase.

Implementation Details

The model was trained using a combination of datasets, with the majority (82.73%) coming from the Aalto Finnish Parliament ASR Corpus. Training was conducted on a Tesla V100 GPU using 8-bit Adam optimizer with a linear learning rate scheduler. The model achieved its best performance after 10 epochs of training.

  • Learning rate: 5e-04 with 500 warmup steps
  • Batch size: 32 for both training and evaluation
  • Mixed precision training with Native AMP
  • Includes fine-tuned acoustic model and KenLM language model

Core Capabilities

  • Transcription of Finnish speech to text
  • Optimal performance on audio clips up to 20 seconds
  • Strong performance on formal Finnish speech
  • WER of 8.16% with language model, 17.92% without
  • Supports real-time transcription with appropriate audio chunking

Frequently Asked Questions

Q: What makes this model unique?

This model combines a large-scale multilingual speech model with Finnish-specific training data and a custom language model, making it particularly effective for Finnish ASR tasks. It's especially strong in formal Finnish speech recognition, thanks to its training on parliamentary proceedings.

Q: What are the recommended use cases?

The model is best suited for transcribing formal Finnish speech, particularly in professional or official contexts. It works optimally with audio clips up to 20 seconds in length and performs best with clear, standard Finnish pronunciation rather than heavy dialects or informal speech.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.