keyphrase-extraction-kbir-inspec

Property	Value
License	MIT
Paper	Research Paper
Dataset	Inspec
F1 Score	0.588

What is keyphrase-extraction-kbir-inspec?

This is a specialized transformer-based model designed for extracting key phrases from scientific texts. Built on the KBIR (Keyphrase Boundary Infilling with Replacement) architecture and fine-tuned on the Inspec dataset, it excels at identifying crucial concepts in academic papers and technical documents. The model treats keyphrase extraction as a token classification problem, labeling words as either beginning (B-KEY), inside (I-KEY), or outside (O) of keyphrases.

Implementation Details

The model implements a sophisticated approach to keyphrase extraction through a multi-task learning setup that combines Masked Language Modeling (MLM), Keyphrase Boundary Infilling (KBI), and Keyphrase Replacement Classification (KRC). It was trained with a learning rate of 1e-4 over 50 epochs with early stopping patience of 3.

Utilizes the KBIR architecture for enhanced semantic understanding
Processes text through token classification pipeline
Supports batch processing with maximum sequence length of 512 tokens
Implements sophisticated post-processing for keyphrase aggregation

Core Capabilities

Automatic extraction of relevant keyphrases from scientific documents
High accuracy with F1@M score of 0.56 on the Inspec test set
Effective handling of long-term semantic dependencies
Support for English language documents
Specialized performance on scientific and technical content

Frequently Asked Questions

Q: What makes this model unique?

This model's uniqueness lies in its use of the KBIR architecture, which combines multiple learning objectives to better understand keyphrase boundaries and context. It achieves state-of-the-art performance on scientific text analysis, with particularly strong results on the Inspec dataset.

Q: What are the recommended use cases?

The model is specifically designed for processing scientific papers and technical documents, particularly in the domains of Computers, Control, and Information Technology. It's most effective when analyzing academic abstracts and technical content, though it may not perform as well on general-domain texts.