hubert-base-cc

Maintained By
SZTAKI-HLT

huBERT-base-cc

PropertyValue
AuthorSZTAKI-HLT
Model TypeBERT (Cased)
LanguageHungarian
Training DataHungarian Common Crawl + Wikipedia
Best NER Score97.62%

What is hubert-base-cc?

huBERT-base-cc is a specialized BERT model designed specifically for Hungarian language processing. Developed by SZTAKI-HLT, this cased model represents a significant advancement in Hungarian NLP, trained on a carefully curated dataset combining filtered and deduplicated Hungarian content from Common Crawl and Wikipedia.

Implementation Details

The model follows the BERT base architecture and has been specifically optimized for Hungarian language understanding. It has undergone extensive training and validation, demonstrating exceptional performance particularly in token classification tasks.

  • Achieves state-of-the-art results on Hungarian NER (97.62%)
  • Excellent performance on chunking tasks (Minimal NP: 97.14%, Maximal NP: 96.97%)
  • Outperforms multilingual BERT on Hungarian tasks

Core Capabilities

  • Named Entity Recognition
  • Chunking (both minimal and maximal NP)
  • General Hungarian language understanding
  • Token classification tasks

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically optimized for Hungarian language processing, offering superior performance compared to multilingual alternatives. It's trained on a comprehensive Hungarian dataset and has achieved state-of-the-art results on multiple benchmarks.

Q: What are the recommended use cases?

The model is particularly well-suited for token classification tasks, especially Named Entity Recognition and text chunking in Hungarian. It can be used like any other cased BERT model but with optimized performance for Hungarian language content.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.