opus-mt-tc-base-en-sh
Property | Value |
---|---|
License | CC-BY-4.0 |
Release Date | 2021-04-20 |
Developer | Helsinki-NLP |
Architecture | transformer-align |
What is opus-mt-tc-base-en-sh?
opus-mt-tc-base-en-sh is a neural machine translation model developed by the Language Technology Research Group at the University of Helsinki. It's specifically designed to translate from English to various Serbo-Croatian language variants, including Bosnian, Croatian, and Serbian (in both Cyrillic and Latin scripts). The model is part of the OPUS-MT project, which aims to make neural machine translation accessible for many world languages.
Implementation Details
The model is built using the MarianNMT framework and later converted to PyTorch using the Hugging Face transformers library. It utilizes a transformer-align architecture and requires specific language tokens (e.g., >>hrv<<, >>bos_Latn<<) to indicate the target language variant.
- Trained on OPUS dataset with additional back-translation data
- Uses SentencePiece tokenization with 32k vocabulary
- Supports multiple target dialects through language tokens
- Achieves BLEU scores ranging from 28.7 to 49.7 depending on the language variant
Core Capabilities
- English to Croatian translation (BLEU: 49.7)
- English to Serbian-Cyrillic translation (BLEU: 45.1)
- English to Bosnian-Latin translation (BLEU: 46.3)
- Multi-target language support with dialect specification
- Production-ready performance on standard translation tasks
Frequently Asked Questions
Q: What makes this model unique?
The model's ability to handle multiple Serbo-Croatian variants through a single model, using language tokens to specify the desired target dialect, makes it particularly versatile for the region's linguistic landscape.
Q: What are the recommended use cases?
This model is ideal for translation systems requiring English to Serbo-Croatian language support, particularly when dealing with multiple regional variants. It's suitable for content localization, document translation, and multilingual applications targeting the Western Balkans region.