ASPIRE Sentence Embedder

Property	Value
Author	Allen AI
License	Apache-2.0
Paper	Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity
Training Data	4.3M sentence pairs from scientific literature

What is aspire-sentence-embedder?

The ASPIRE sentence embedder is a specialized SciBERT-based model designed for scientific text similarity tasks. Developed by Allen AI, it represents a significant advancement in processing academic and scientific literature, particularly excelling in biomedical domains. The model generates sentence embeddings through the CLS token representation and was trained on an extensive dataset of co-citation contexts from the Semantic Scholar Open Research Corpus (S2ORC).

Implementation Details

The model employs a contrastive learning setup using co-citation context sentences as training data. It's optimized using the Adam optimizer with a 2e-5 learning rate and 1000 warm-up steps, followed by linear decay. The architecture leverages the SciBERT backbone and can be easily implemented using either the transformers or sentence_transformers libraries.

Trained on 4.3 million sentence pairs from scientific literature
Utilizes in-batch negative sampling during training
Supports maximum sequence length of 512 tokens
Optimized for both biomedical and computer science domains

Core Capabilities

Sentence-level similarity computation for scientific texts
Document retrieval through sentence-level matching
Fine-grained scientific document comparison
Adaptable for classification tasks through fine-tuning

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its specialized training on scientific co-citations, making it particularly effective for academic text similarity tasks. About 50% of its training data comes from biomedical sources, giving it superior performance in this domain while maintaining strong capabilities across other scientific fields.

Q: What are the recommended use cases?

The model is best suited for sentence similarity tasks in scientific text, particularly in biomedical and computer science domains. It can be used for document retrieval, abstract comparison, and with fine-tuning, can be adapted for classification tasks. It shows strong performance on datasets like RELISH, TRECCOVID, and CSFCube.