OpenLID-v2
Property | Value |
---|---|
Authors | Laurie Burchell, Alexandra Birch, Nikolay Bogoychev, Kenneth Heafield |
License | GPL-3.0 |
Model Type | Language Identification (Text Classification) |
Language Coverage | 189 languages |
Architecture | FastText |
What is OpenLID-v2?
OpenLID-v2 is an advanced language identification model that represents a significant improvement over its predecessor. It achieves a remarkable macro-average F1 score of 0.93 across 201 languages, with an impressively low false positive rate of 0.033%. The model utilizes FastText architecture and has been specifically designed to handle a wide range of languages with high accuracy and reliability.
Implementation Details
The model employs sophisticated training parameters including a softmax loss function, 256-dimensional embeddings, and character n-grams ranging from 2-5. It was trained for 2 epochs with a learning rate of 0.8 and uses a substantial bucket size of 1,000,000 for optimal performance.
- Embedding dimension: 256
- Character n-grams: 2-5
- Word n-grams: 1
- Minimum word occurrences: 1000
- Temperature-based sampling for class balancing
Core Capabilities
- High-accuracy language identification across 189 languages
- Clean text preprocessing functionality
- Multiple language prediction capability (k-best outputs)
- Robust performance on various text domains
- Detailed confidence scores for predictions
Frequently Asked Questions
Q: What makes this model unique?
OpenLID-v2 stands out for its exceptional coverage of 189 languages while maintaining high accuracy. Its improved architecture and carefully curated training data make it particularly reliable for real-world applications.
Q: What are the recommended use cases?
The model is ideal for language detection in multilingual content processing, content filtering, and as a preprocessing step in NLP pipelines. It's particularly useful when dealing with diverse language sets and when high accuracy is crucial.