opus-mt-en-trk
Property | Value |
---|---|
License | Apache 2.0 |
Developer | Helsinki-NLP |
Architecture | Transformer |
Training Date | 2020-08-01 |
What is opus-mt-en-trk?
opus-mt-en-trk is a specialized machine translation model designed to translate from English to various Turkic languages. Developed by Helsinki-NLP, this transformer-based model supports translation into 24+ language variants including Turkish, Azerbaijani, Kazakh, and Uzbek in different scripts (Latin, Cyrillic, and Arabic).
Implementation Details
The model utilizes a transformer architecture with SentencePiece tokenization (spm32k,spm32k) and requires a specific language token (>>id<<) at the beginning of input sentences to indicate the target language. The model was trained on the OPUS corpus and demonstrates varying performance across different Turkic languages, with Turkish (BLEU: 34.6) and Azerbaijani (BLEU: 26.8) showing the strongest results.
- Preprocessing includes normalization and SentencePiece tokenization
- Supports multiple script variants (Latin, Cyrillic, Arabic) for several languages
- Trained on OPUS corpus with 2M sentence pairs
- Implements language-specific tokens for target language selection
Core Capabilities
- Multi-target translation supporting 24+ Turkic language variants
- Handles both modern and historical Turkic languages (including Ottoman Turkish)
- Best performance for Turkish (BLEU: 34.6) and Kyrgyz (BLEU: 28.6)
- Supports different writing systems for the same language
Frequently Asked Questions
Q: What makes this model unique?
This model's unique feature is its ability to handle multiple Turkic languages and their script variants in a single model, making it a versatile tool for translation into the entire Turkic language family. The use of language tokens allows for dynamic target language selection.
Q: What are the recommended use cases?
The model is best suited for translating into major Turkic languages like Turkish, Azerbaijani, and Kyrgyz where it shows the highest BLEU scores. It's particularly useful for applications requiring translation into multiple Turkic languages, though performance varies significantly between languages.