berteus-base-cased

Maintained By
ixa-ehu

BERTeus-base-cased

PropertyValue
Authorixa-ehu
Training Data224.6M tokens (189.6M news + 35M Wikipedia)
PaperGive your Text Representation Models some Love: the Case for Basque

What is berteus-base-cased?

BERTeus is a groundbreaking BERT-based model specifically designed for the Basque language. It represents a significant advancement in Basque natural language processing, trained on a comprehensive corpus of 224.6 million tokens derived from Basque news articles and Wikipedia content. This model has established new state-of-the-art benchmarks across multiple critical NLP tasks.

Implementation Details

The model utilizes the BERT architecture while being specifically optimized for Basque language processing. It has been trained on a carefully curated dataset comprising Basque crawled news articles and Wikipedia content, with the latter contributing 35 million tokens to the training corpus.

  • Cased model maintaining original text capitalization
  • Based on the BERT architecture
  • Trained on a diverse corpus of Basque text

Core Capabilities

  • Part-of-Speech (POS) Tagging: 97.76% accuracy (surpassing mBERT's 96.37%)
  • Named Entity Recognition (NER): 87.06% accuracy (beating previous SOTA of 76.72%)
  • Sentiment Analysis: 78.10% accuracy (improving upon 74.02%)
  • Topic Classification: 76.77% accuracy (significant improvement over 63.00%)

Frequently Asked Questions

Q: What makes this model unique?

BERTeus is the first BERT model specifically optimized for the Basque language, achieving superior performance compared to multilingual BERT (mBERT) across all tested tasks. It demonstrates significant improvements in critical NLP tasks, particularly showing a remarkable 10.34% improvement in NER over previous state-of-the-art systems.

Q: What are the recommended use cases?

The model is particularly well-suited for Basque language processing tasks including POS tagging, NER, sentiment analysis, and topic classification. It's recommended for researchers and developers working with Basque language content who require high-accuracy natural language processing capabilities.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.