aspire-biencoder-compsci-spec

Property	Value
Author	Allen AI
Paper	Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity
GitHub	Project Repository
Domain	Computer Science / Scientific Documents

What is aspire-biencoder-compsci-spec?

The aspire-biencoder-compsci-spec is a sophisticated BERT-based bi-encoder model designed specifically for analyzing similarity between scientific documents. Initialized from the SPECTER model, it represents documents using a unique approach that combines title and abstract information into a single vector through a scalar mix of CLS tokens across all encoder layers.

Implementation Details

The model was trained on an extensive dataset of 1.2 million biomedical paper pairs using a contrastive learning approach. Training utilized the Adam Optimizer with a 2e-5 learning rate, incorporating 1000 warm-up steps and linear decay. The model leverages co-citation patterns in scientific literature, where papers cited together in the same context are used as training pairs.

Training utilizes in-batch negative sampling for contrastive learning
Implements scalar mix parameters for enhanced performance
Focuses on title-abstract pair representations
Achieves superior performance compared to SPECTER baseline (37.17 vs 34.23 MAP on CSFCube)

Core Capabilities

Fine-grained scientific document similarity analysis
Document representation through single vector encoding
Specialized performance in computer science domain
Adaptable for classification tasks through fine-tuning
Efficient processing of title-abstract combinations

Frequently Asked Questions

Q: What makes this model unique?

This model's uniqueness lies in its training on co-citation data rather than direct citations, allowing it to capture more nuanced relationships between scientific documents. It also employs a sophisticated scalar mix approach for token representation that can be crucial for performance in specific datasets.

Q: What are the recommended use cases?

The model is best suited for computer science document similarity tasks, particularly when working with paper titles and abstracts. While primarily designed for similarity assessment, it can be fine-tuned for classification tasks. However, performance may vary when applied to domains outside computer science.