keyphrase-extraction-kbir-openkp
Property | Value |
---|---|
Model Type | Token Classification |
Training Dataset | OpenKP (148,124 web documents) |
Learning Rate | 1e-4 |
Training Epochs | 50 |
Early Stopping | 3 epochs patience |
What is keyphrase-extraction-kbir-openkp?
This is a specialized transformer model designed for automatic keyphrase extraction from text documents. Built on the KBIR (Keyphrase Boundary Infilling with Replacement) architecture and fine-tuned on the OpenKP dataset, it employs a multi-task learning approach combining Masked Language Modeling, Keyphrase Boundary Infilling, and Keyphrase Replacement Classification to identify the most relevant keyphrases in a text.
Implementation Details
The model operates as a token classification system, categorizing each word as either the beginning of a keyphrase (B-KEY), inside a keyphrase (I-KEY), or outside a keyphrase (O). It achieves impressive performance metrics, including P@5: 0.13, R@5: 0.38, and F1@M: 0.39 on the OpenKP test set.
- Multi-task learning architecture combining MLM, KBI, and KRC
- Trained on 148,124 real-world web documents
- Implements sophisticated token classification pipeline
- Supports batch processing and efficient keyphrase extraction
Core Capabilities
- Automatic extraction of relevant keyphrases from English text
- Semantic understanding of document context
- High-precision keyphrase boundary detection
- Efficient processing of long documents
- Support for both single and multi-word keyphrases
Frequently Asked Questions
Q: What makes this model unique?
The model's uniqueness lies in its KBIR architecture and multi-task learning approach, which allows it to capture semantic dependencies and context better than traditional statistical methods. It's particularly effective at understanding the contextual relationships between words and phrases in a document.
Q: What are the recommended use cases?
The model is ideal for automatic document indexing, content summarization, and automated metadata generation. It's particularly useful for processing large volumes of text documents where manual keyphrase extraction would be time-consuming. However, it's limited to English language documents and has a cap on the number of predicted keyphrases.