huBERT-base-cc

Property	Value
Author	SZTAKI-HLT
Model Type	BERT (Cased)
Language	Hungarian
Training Data	Hungarian Common Crawl + Wikipedia
Best NER Score	97.62%

What is hubert-base-cc?

huBERT-base-cc is a specialized BERT model designed specifically for Hungarian language processing. Developed by SZTAKI-HLT, this cased model represents a significant advancement in Hungarian NLP, trained on a carefully curated dataset combining filtered and deduplicated Hungarian content from Common Crawl and Wikipedia.

Implementation Details

The model follows the BERT base architecture and has been specifically optimized for Hungarian language understanding. It has undergone extensive training and validation, demonstrating exceptional performance particularly in token classification tasks.

Achieves state-of-the-art results on Hungarian NER (97.62%)
Excellent performance on chunking tasks (Minimal NP: 97.14%, Maximal NP: 96.97%)
Outperforms multilingual BERT on Hungarian tasks

Core Capabilities

Named Entity Recognition
Chunking (both minimal and maximal NP)
General Hungarian language understanding
Token classification tasks

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically optimized for Hungarian language processing, offering superior performance compared to multilingual alternatives. It's trained on a comprehensive Hungarian dataset and has achieved state-of-the-art results on multiple benchmarks.

Q: What are the recommended use cases?

The model is particularly well-suited for token classification tasks, especially Named Entity Recognition and text chunking in Hungarian. It can be used like any other cased BERT model but with optimized performance for Hungarian language content.

hubert-base-cc