keyphrase-extraction-distilbert-kptimes
Property | Value |
---|---|
Author | ml6team |
Training Dataset | KPTimes (279,923 news articles) |
Learning Rate | 1e-4 |
Training Epochs | 50 |
Early Stopping Patience | 3 |
What is keyphrase-extraction-distilbert-kptimes?
This is a specialized transformer-based model designed for automatic keyphrase extraction from text documents. Built on KBIR architecture and fine-tuned using the KPTimes dataset, it approaches keyphrase extraction as a token classification problem, where each word is classified as either being part of a keyphrase (B-KEY or I-KEY) or not (O).
Implementation Details
The model utilizes DistilBERT as its foundation and implements a token classification pipeline for keyphrase extraction. It processes text through a sophisticated neural network that can capture long-term semantic dependencies and contextual relationships between words, surpassing traditional statistical methods.
- Token Classification Labels: B-KEY (Beginning of keyphrase), I-KEY (Inside keyphrase), O (Outside keyphrase)
- Maximum sequence length: 512 tokens
- Specialized preprocessing pipeline for proper token alignment
- Post-processing functionality for keyphrase aggregation
Core Capabilities
- Automatic extraction of relevant keyphrases from news articles
- High performance on NY Times-style content
- Efficient processing of large text documents
- Semantic understanding of content context
- Batch processing capabilities
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its specialized training on news articles from the KPTimes dataset, making it particularly effective for journalistic content analysis. Its deep learning approach allows it to understand semantic context better than traditional statistical methods.
Q: What are the recommended use cases?
The model is best suited for keyphrase extraction from news articles, particularly those similar to NY Times content. It's not recommended for other domains, though testing in other areas is possible. The model only works with English language content.