bcms-bertic

classla

BERTić is a transformer language model for Bosnian, Croatian, Montenegrin and Serbian, trained on 8B+ tokens with superior performance vs mBERT

Property	Value
License	Apache 2.0
Languages	Bosnian, Croatian, Montenegrin, Serbian
Architecture	ELECTRA-based Transformer

What is bcms-bertic?

BERTić is a state-of-the-art transformer language model specifically designed for Bosnian, Croatian, Montenegrin and Serbian languages. Trained on an impressive dataset of over 8 billion tokens, it represents a significant advancement in natural language processing for these Balto-Slavic languages. The model's name cleverly incorporates the "-ić" suffix, common in Croatian diminutives and surnames across these regions.

Implementation Details

Built on the ELECTRA architecture, BERTić demonstrates superior performance compared to multilingual BERT and CroSloEngual BERT across multiple NLP tasks. The model has been extensively evaluated on various benchmarks, showing particularly strong results in part-of-speech tagging, named entity recognition, geolocation prediction, and commonsense causal reasoning.

Achieves up to 95.81% accuracy in Croatian POS tagging
Reaches 89.21% F1-score in Croatian NER tasks
Demonstrates superior geolocation prediction with 37.96 median distance error
Shows 65.76% accuracy in the COPA dataset for causal reasoning

Core Capabilities

Part-of-speech tagging for standard and non-standard language varieties
Named entity recognition across multiple language variants
Geolocation prediction from social media text
Commonsense causal reasoning
Support for both formal and internet-based language varieties

Frequently Asked Questions

Q: What makes this model unique?

BERTić is the first transformer model specifically optimized for Bosnian, Croatian, Montenegrin and Serbian languages, consistently outperforming multilingual alternatives across various NLP tasks. Its training on 8B+ tokens makes it particularly robust for these languages.

Q: What are the recommended use cases?

The model is ideal for tasks involving standard and non-standard language processing in BCMS languages, including POS tagging, NER, text classification, and semantic analysis. It's particularly effective for applications requiring understanding of both formal and informal language varieties.