punctuate-all
Property | Value |
---|---|
License | MIT |
Base Architecture | XLM-RoBERTa-base |
Task | Token Classification |
Dataset | WMT/Europarl |
What is punctuate-all?
punctuate-all is a multilingual punctuation restoration model that builds upon Oliver Guhr's work, offering support for twelve languages using a fine-tuned XLM-RoBERTa-base architecture. The model demonstrates exceptional accuracy in restoring various punctuation marks across English, German, French, Spanish, Bulgarian, Italian, Polish, Dutch, Czech, Portuguese, Slovak, and Slovenian texts.
Implementation Details
The model achieves remarkable performance metrics with an overall accuracy of 98% across all punctuation tasks. It excels particularly in period and comma detection, with F1-scores of 0.95 and 0.86 respectively. The model handles six different punctuation types: period, comma, question mark, hyphen, and colon, with varying degrees of precision and recall.
- Period detection: 94% precision, 95% recall
- Comma detection: 86% precision, 86% recall
- Question mark detection: 88% precision, 85% recall
- Built on PyTorch framework with Transformer architecture
Core Capabilities
- Multilingual support for 12 European languages
- High-accuracy punctuation restoration (98% overall accuracy)
- Efficient processing with base model architecture
- Specialized handling of multiple punctuation marks
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its broad language support (12 languages) while maintaining high accuracy using a smaller base model compared to the original large model implementation. It achieves this while maintaining comparable performance metrics.
Q: What are the recommended use cases?
The model is ideal for automated transcription post-processing, text normalization tasks, and any NLP pipeline requiring punctuation restoration across multiple European languages. It's particularly effective for period and comma restoration, making it suitable for processing raw text from speech recognition systems.