vlt5-base-keywords
Property | Value |
---|---|
Parameter Count | 275M |
License | CC BY 4.0 |
Languages | English, Polish |
Paper | Link to Paper |
What is vlt5-base-keywords?
vlt5-base-keywords is a specialized keyword generation model developed by Voicelab, based on the T5 transformer architecture. It's designed to extract and generate keywords from scientific texts, particularly abstracts and titles, using an encoder-decoder architecture. The model was trained on the POSMAC corpus, containing over 216,000 scientific publication abstracts.
Implementation Details
The model utilizes a sentencepiece unigram tokenizer with a 50k vocabulary size and implements specific generation parameters (no_repeat_ngram_size=3, num_beams=4) for optimal results. It can process both English and Polish texts effectively, typically generating 3-5 keywords per abstract-length input.
- Architecture based on Google's T5 transformer blocks
- Trained on scientific article corpus (POSMAC)
- Supports both extractive and abstractive keyword generation
- 275M parameters for robust performance
Core Capabilities
- Precise keyword extraction from academic texts
- Multi-language support (primarily English and Polish)
- High precision in generating relevant keyphrases
- Effective on various text domains
- Achieves state-of-the-art results with 0.357 precision at rank 5
Frequently Asked Questions
Q: What makes this model unique?
The model's ability to work across different domains and text types while maintaining high precision in keyword generation makes it stand out. It can generate both extractive and abstractive keywords, showing flexibility in its application.
Q: What are the recommended use cases?
The model is best suited for academic and scientific text analysis, particularly for extracting keywords from abstracts and research papers. It works optimally with text pieces of abstract length, though longer texts can be processed by splitting them into smaller chunks.