keyphrase-extraction-distilbert-kptimes

Property	Value
Author	ml6team
Training Dataset	KPTimes (279,923 news articles)
Learning Rate	1e-4
Training Epochs	50
Early Stopping Patience	3

What is keyphrase-extraction-distilbert-kptimes?

This is a specialized transformer-based model designed for automatic keyphrase extraction from text documents. Built on KBIR architecture and fine-tuned using the KPTimes dataset, it approaches keyphrase extraction as a token classification problem, where each word is classified as either being part of a keyphrase (B-KEY or I-KEY) or not (O).

Implementation Details

The model utilizes DistilBERT as its foundation and implements a token classification pipeline for keyphrase extraction. It processes text through a sophisticated neural network that can capture long-term semantic dependencies and contextual relationships between words, surpassing traditional statistical methods.

Token Classification Labels: B-KEY (Beginning of keyphrase), I-KEY (Inside keyphrase), O (Outside keyphrase)
Maximum sequence length: 512 tokens
Specialized preprocessing pipeline for proper token alignment
Post-processing functionality for keyphrase aggregation

Core Capabilities

Automatic extraction of relevant keyphrases from news articles
High performance on NY Times-style content
Efficient processing of large text documents
Semantic understanding of content context
Batch processing capabilities

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its specialized training on news articles from the KPTimes dataset, making it particularly effective for journalistic content analysis. Its deep learning approach allows it to understand semantic context better than traditional statistical methods.

Q: What are the recommended use cases?

The model is best suited for keyphrase extraction from news articles, particularly those similar to NY Times content. It's not recommended for other domains, though testing in other areas is possible. The model only works with English language content.