opus-mt-en-iir
Property | Value |
---|---|
Model Type | Transformer |
Task | Machine Translation |
Source Language | English |
Target Languages | 30+ Indo-Iranian languages |
BLEU Score | 13.7 (Tatoeba test) |
Training Date | August 1, 2020 |
Model URL | Hugging Face |
What is opus-mt-en-iir?
opus-mt-en-iir is a specialized machine translation model developed by Helsinki-NLP for translating English text into various Indo-Iranian languages. The model supports over 30 target languages including Hindi, Bengali, Persian, Gujarati, and many others. It uses a transformer architecture and implements SentencePiece tokenization with a 32k vocabulary.
Implementation Details
The model employs normalization and SentencePiece preprocessing, requiring a sentence-initial language token in the format >>id<< where id represents the target language identifier. It was trained on the OPUS corpus and demonstrates varying performance across different language pairs, with particularly strong results for Marathi (BLEU 20.7), Hindi (BLEU 17.0), and Bengali (BLEU 15.3).
- Preprocessing: Normalization + SentencePiece (spm32k,spm32k)
- Architecture: Transformer-based neural machine translation
- Performance metrics: Overall BLEU score of 13.7 and chrF score of 0.392
Core Capabilities
- Supports translation to major Indo-Iranian languages including Hindi, Bengali, Persian, and Gujarati
- Handles multiple script systems including Devanagari, Arabic, Cyrillic, and Latin
- Provides consistent performance across news and general domain content
- Offers flexibility through language-specific tokens for target language selection
Frequently Asked Questions
Q: What makes this model unique?
This model's primary strength lies in its broad coverage of Indo-Iranian languages, supporting over 30 target languages with a single model. It's particularly useful for low-resource languages in this family, providing a practical solution for multilingual translation needs.
Q: What are the recommended use cases?
The model is best suited for general-purpose translation tasks from English to Indo-Iranian languages. It shows particularly strong performance for languages like Marathi, Hindi, and Bengali, making it ideal for content localization, document translation, and cross-lingual information access in South Asian contexts.