nerkor-cars-onpp-hubert

novakat

Hungarian named entity recognition model built on HuBERT, supporting 30+ entity types including OntoNotes 5.0 categories and custom tags for vehicles and media

Property	Value
Base Model	SZTAKI-HLT/hubert-base-cc
Max Sequence Length	448 tokens
Paper	NerKor+Cars-OntoNotes++
Author	novakat

What is nerkor-cars-onpp-hubert?

nerkor-cars-onpp-hubert is a specialized Hungarian named entity recognition model that builds upon the NYTK-NerKor corpus, significantly expanding its capabilities with over 30 entity types. The model is based on the HuBERT architecture and has been fine-tuned on an enhanced dataset that includes both traditional OntoNotes 5.0 categories and additional custom entity types specifically designed for Hungarian language processing.

Implementation Details

The model is trained on a comprehensive corpus containing approximately 1 million tokens from NYTK-NerKor, supplemented with 12,000 tokens of specialized automotive content from hvg.hu. It implements a sophisticated named entity recognition system that goes beyond the traditional CoNLL2002 four-category classification.

Built on SZTAKI-HLT/hubert-base-cc pretrained model
Supports sequence lengths up to 448 tokens
Incorporates all OntoNotes 5.0 entity types plus custom extensions
Features specialized automotive and media entity recognition capabilities

Core Capabilities

Recognizes standard entities (PER, ORG, LOC, GPE, etc.)
Handles temporal expressions (DATE, TIME, DUR)
Processes numerical entities (PERCENT, MONEY, QUANTITY)
Identifies specialized categories (CAR, MEDIA, SMEDIA)
Supports extended miscellaneous categories (AWARD, PROJ)

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its comprehensive coverage of entity types, combining standard OntoNotes 5.0 categories with specialized Hungarian-specific entities. It's particularly notable for its additional capability to recognize automotive-related entities and various media categories, making it highly versatile for Hungarian NLP tasks.

Q: What are the recommended use cases?

The model is ideal for advanced Hungarian text analysis tasks, particularly in contexts requiring detailed entity recognition. It's especially suitable for processing news content, automotive-related texts, and general Hungarian language documents requiring fine-grained entity classification.