keyphrase-extraction-kbir-openkp

Property	Value
Model Type	Token Classification
Training Dataset	OpenKP (148,124 web documents)
Learning Rate	1e-4
Training Epochs	50
Early Stopping	3 epochs patience

What is keyphrase-extraction-kbir-openkp?

This is a specialized transformer model designed for automatic keyphrase extraction from text documents. Built on the KBIR (Keyphrase Boundary Infilling with Replacement) architecture and fine-tuned on the OpenKP dataset, it employs a multi-task learning approach combining Masked Language Modeling, Keyphrase Boundary Infilling, and Keyphrase Replacement Classification to identify the most relevant keyphrases in a text.

Implementation Details

The model operates as a token classification system, categorizing each word as either the beginning of a keyphrase (B-KEY), inside a keyphrase (I-KEY), or outside a keyphrase (O). It achieves impressive performance metrics, including P@5: 0.13, R@5: 0.38, and F1@M: 0.39 on the OpenKP test set.

Multi-task learning architecture combining MLM, KBI, and KRC
Trained on 148,124 real-world web documents
Implements sophisticated token classification pipeline
Supports batch processing and efficient keyphrase extraction

Core Capabilities

Automatic extraction of relevant keyphrases from English text
Semantic understanding of document context
High-precision keyphrase boundary detection
Efficient processing of long documents
Support for both single and multi-word keyphrases

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its KBIR architecture and multi-task learning approach, which allows it to capture semantic dependencies and context better than traditional statistical methods. It's particularly effective at understanding the contextual relationships between words and phrases in a document.

Q: What are the recommended use cases?

The model is ideal for automatic document indexing, content summarization, and automated metadata generation. It's particularly useful for processing large volumes of text documents where manual keyphrase extraction would be time-consuming. However, it's limited to English language documents and has a cap on the number of predicted keyphrases.