aspire-contextualsentence-singlem-compsci

Property	Value
Author	Allen AI
Paper	Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity
GitHub	allenai/aspire
Performance (MAP)	41.33 on CSFCube

What is aspire-contextualsentence-singlem-compsci?

This is a specialized BERT-based model designed for fine-grained similarity matching between computer science papers. It represents documents using contextual sentence vectors, created by averaging token representations of individual sentences while maintaining cross-attention between the title and abstract. The model was trained on 1.2 million computer science paper pairs using co-citation contexts for alignment.

Implementation Details

The model uses the Adam Optimizer with a 2e-5 learning rate and 1000 warm-up steps, followed by linear decay. It processes paper titles and abstracts to generate sentence-level embeddings, enabling fine-grained document similarity comparisons through L2 distance calculations between sentence vectors.

Trained on co-cited paper pairs with sentence alignment
Uses contrastive learning with in-batch negatives
Implements cross-attention in the encoder block
Evaluates using minimal L2 distance between sentences

Core Capabilities

Fine-grained document similarity analysis
Aspect-conditional document retrieval
Sentence-to-sentence similarity matching
Computer science domain expertise
Document classification (with fine-tuning)

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its ability to perform fine-grained similarity matching using multiple vectors per document, leveraging co-citation contexts for training. This allows for more precise document comparison at the sentence level, rather than just document-level matching.

Q: What are the recommended use cases?

The model is best suited for tasks involving computer science paper similarity, particularly when specific aspects or sentences need to be matched. It excels in scenarios where users need to find papers based on specific sentences or concepts rather than entire document similarity.