ASPIRE Sentence Embedder
Property | Value |
---|---|
Author | Allen AI |
License | Apache-2.0 |
Paper | Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity |
Training Data | 4.3M sentence pairs from scientific literature |
What is aspire-sentence-embedder?
The ASPIRE sentence embedder is a specialized SciBERT-based model designed for scientific text similarity tasks. Developed by Allen AI, it represents a significant advancement in processing academic and scientific literature, particularly excelling in biomedical domains. The model generates sentence embeddings through the CLS token representation and was trained on an extensive dataset of co-citation contexts from the Semantic Scholar Open Research Corpus (S2ORC).
Implementation Details
The model employs a contrastive learning setup using co-citation context sentences as training data. It's optimized using the Adam optimizer with a 2e-5 learning rate and 1000 warm-up steps, followed by linear decay. The architecture leverages the SciBERT backbone and can be easily implemented using either the transformers or sentence_transformers libraries.
- Trained on 4.3 million sentence pairs from scientific literature
- Utilizes in-batch negative sampling during training
- Supports maximum sequence length of 512 tokens
- Optimized for both biomedical and computer science domains
Core Capabilities
- Sentence-level similarity computation for scientific texts
- Document retrieval through sentence-level matching
- Fine-grained scientific document comparison
- Adaptable for classification tasks through fine-tuning
Frequently Asked Questions
Q: What makes this model unique?
The model's uniqueness lies in its specialized training on scientific co-citations, making it particularly effective for academic text similarity tasks. About 50% of its training data comes from biomedical sources, giving it superior performance in this domain while maintaining strong capabilities across other scientific fields.
Q: What are the recommended use cases?
The model is best suited for sentence similarity tasks in scientific text, particularly in biomedical and computer science domains. It can be used for document retrieval, abstract comparison, and with fine-tuning, can be adapted for classification tasks. It shows strong performance on datasets like RELISH, TRECCOVID, and CSFCube.