postagger-portuguese
Property | Value |
---|---|
Author | lisaterumi |
F1-Score | 0.9826 |
Base Model | BERTimbau |
Training Data | MacMorpho corpus |
Paper DOI | 10.59681/2175-4411.v15.iEspecial.2023.1086 |
What is postagger-portuguese?
postagger-portuguese is a state-of-the-art Part-of-Speech (POS) tagger specifically designed for the Portuguese language. Built by fine-tuning the BERTimbau model on the MacMorpho corpus, it achieves an impressive 98.26% F1-score in identifying 27 different grammatical categories in Portuguese text.
Implementation Details
The model was trained with carefully selected hyperparameters including 30 epochs, batch size of 32, and a learning rate of 1e-5. It can process sequences up to 200 tokens and implements early stopping after 3 epochs without improvement. The architecture leverages the powerful BERTimbau base model, fine-tuned specifically for morphological analysis.
- 27 distinct POS tag classes
- 98.26% accuracy on evaluation set
- Optimized for clinical and general Portuguese text
- Comprehensive tag set including specialized categories like ADV-KS-REL and PRO-KS
Core Capabilities
- Advanced morphological analysis of Portuguese text
- Identification of complex grammatical structures
- Support for both clinical and general domain text
- High-precision tagging of pronouns, verbs, and specialized linguistic elements
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its exceptional accuracy in Portuguese POS tagging, particularly in handling both clinical and general text. Its performance (98.26% F1-score) represents the state-of-the-art for the MacMorpho corpus, making it particularly valuable for Portuguese NLP applications.
Q: What are the recommended use cases?
The model is particularly well-suited for processing Electronic Health Records and clinical narratives in Portuguese, achieving 81.45% accuracy on clinical texts compared to 76.56% for generic models. It's also effective for general linguistic analysis, academic research, and any NLP pipeline requiring accurate Portuguese part-of-speech tagging.