vlt5-base-keywords

Property	Value
Parameter Count	275M
License	CC BY 4.0
Languages	English, Polish
Paper	Link to Paper

What is vlt5-base-keywords?

vlt5-base-keywords is a specialized keyword generation model developed by Voicelab, based on the T5 transformer architecture. It's designed to extract and generate keywords from scientific texts, particularly abstracts and titles, using an encoder-decoder architecture. The model was trained on the POSMAC corpus, containing over 216,000 scientific publication abstracts.

Implementation Details

The model utilizes a sentencepiece unigram tokenizer with a 50k vocabulary size and implements specific generation parameters (no_repeat_ngram_size=3, num_beams=4) for optimal results. It can process both English and Polish texts effectively, typically generating 3-5 keywords per abstract-length input.

Architecture based on Google's T5 transformer blocks
Trained on scientific article corpus (POSMAC)
Supports both extractive and abstractive keyword generation
275M parameters for robust performance

Core Capabilities

Precise keyword extraction from academic texts
Multi-language support (primarily English and Polish)
High precision in generating relevant keyphrases
Effective on various text domains
Achieves state-of-the-art results with 0.357 precision at rank 5

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to work across different domains and text types while maintaining high precision in keyword generation makes it stand out. It can generate both extractive and abstractive keywords, showing flexibility in its application.

Q: What are the recommended use cases?

The model is best suited for academic and scientific text analysis, particularly for extracting keywords from abstracts and research papers. It works optimally with text pieces of abstract length, though longer texts can be processed by splitting them into smaller chunks.