punct_cap_seg_47_language

Maintained By
1-800-BAD-CODE

punct_cap_seg_47_language

PropertyValue
LicenseApache 2.0
Architecture6-layer Transformer (512 dim)
Languages47 languages
Training DataWMT News Crawl (1M lines per language)

What is punct_cap_seg_47_language?

This is a multilingual model designed to process unpunctuated, lowercase text by adding proper punctuation, capitalization, and sentence boundaries. It supports 47 languages including major scripts like Latin, Chinese, Arabic, and Ethiopic, handling each language's specific punctuation requirements through a unified architecture.

Implementation Details

The model employs a sophisticated pipeline starting with a SentencePiece tokenizer (64k vocabulary) and a 6-layer Transformer encoder. It processes text in multiple stages: post-punctuation prediction, pre-punctuation prediction, sentence boundary detection, and true-casing, with each stage building upon the previous ones' outputs.

  • Handles multiple punctuation types including language-specific marks like ?,。(Chinese/Japanese), ។,៕ (Khmer)
  • Maximum sequence length of 128 tokens
  • Trained on news data with 1M lines per language
  • Supports true-casing for various scenarios including acronyms and proper nouns

Core Capabilities

  • Punctuation restoration across 47 languages
  • Language-specific punctuation handling (e.g., inverted question marks for Spanish)
  • Sentence boundary detection
  • Intelligent capitalization including proper nouns and sentence starts

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to handle 47 languages through a single architecture without requiring language identification makes it unique. It can process various script systems and their specific punctuation rules uniformly.

Q: What are the recommended use cases?

The model is particularly suited for processing ASR output or any lowercase, unpunctuated text that needs proper formatting. It works best with formal text similar to news content rather than conversational text.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.