OpenLID-v2

Maintained By
laurievb

OpenLID-v2

PropertyValue
AuthorsLaurie Burchell, Alexandra Birch, Nikolay Bogoychev, Kenneth Heafield
LicenseGPL-3.0
Model TypeLanguage Identification (Text Classification)
Language Coverage189 languages
ArchitectureFastText

What is OpenLID-v2?

OpenLID-v2 is an advanced language identification model that represents a significant improvement over its predecessor. It achieves a remarkable macro-average F1 score of 0.93 across 201 languages, with an impressively low false positive rate of 0.033%. The model utilizes FastText architecture and has been specifically designed to handle a wide range of languages with high accuracy and reliability.

Implementation Details

The model employs sophisticated training parameters including a softmax loss function, 256-dimensional embeddings, and character n-grams ranging from 2-5. It was trained for 2 epochs with a learning rate of 0.8 and uses a substantial bucket size of 1,000,000 for optimal performance.

  • Embedding dimension: 256
  • Character n-grams: 2-5
  • Word n-grams: 1
  • Minimum word occurrences: 1000
  • Temperature-based sampling for class balancing

Core Capabilities

  • High-accuracy language identification across 189 languages
  • Clean text preprocessing functionality
  • Multiple language prediction capability (k-best outputs)
  • Robust performance on various text domains
  • Detailed confidence scores for predictions

Frequently Asked Questions

Q: What makes this model unique?

OpenLID-v2 stands out for its exceptional coverage of 189 languages while maintaining high accuracy. Its improved architecture and carefully curated training data make it particularly reliable for real-world applications.

Q: What are the recommended use cases?

The model is ideal for language detection in multilingual content processing, content filtering, and as a preprocessing step in NLP pipelines. It's particularly useful when dealing with diverse language sets and when high accuracy is crucial.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.