OpenLID-v2

OpenLID-v2

laurievb

High-performance language identification model capable of detecting 189 languages with 0.93 F1 score, built on FastText architecture with improved accuracy and coverage.

PropertyValue
AuthorsLaurie Burchell, Alexandra Birch, Nikolay Bogoychev, Kenneth Heafield
LicenseGPL-3.0
Model TypeLanguage Identification (Text Classification)
Language Coverage189 languages
ArchitectureFastText

What is OpenLID-v2?

OpenLID-v2 is an advanced language identification model that represents a significant improvement over its predecessor. It achieves a remarkable macro-average F1 score of 0.93 across 201 languages, with an impressively low false positive rate of 0.033%. The model utilizes FastText architecture and has been specifically designed to handle a wide range of languages with high accuracy and reliability.

Implementation Details

The model employs sophisticated training parameters including a softmax loss function, 256-dimensional embeddings, and character n-grams ranging from 2-5. It was trained for 2 epochs with a learning rate of 0.8 and uses a substantial bucket size of 1,000,000 for optimal performance.

  • Embedding dimension: 256
  • Character n-grams: 2-5
  • Word n-grams: 1
  • Minimum word occurrences: 1000
  • Temperature-based sampling for class balancing

Core Capabilities

  • High-accuracy language identification across 189 languages
  • Clean text preprocessing functionality
  • Multiple language prediction capability (k-best outputs)
  • Robust performance on various text domains
  • Detailed confidence scores for predictions

Frequently Asked Questions

Q: What makes this model unique?

OpenLID-v2 stands out for its exceptional coverage of 189 languages while maintaining high accuracy. Its improved architecture and carefully curated training data make it particularly reliable for real-world applications.

Q: What are the recommended use cases?

The model is ideal for language detection in multilingual content processing, content filtering, and as a preprocessing step in NLP pipelines. It's particularly useful when dealing with diverse language sets and when high accuracy is crucial.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026