aspire-biencoder-compsci-spec

Maintained By
allenai

aspire-biencoder-compsci-spec

PropertyValue
AuthorAllen AI
PaperMulti-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity
GitHubProject Repository
DomainComputer Science / Scientific Documents

What is aspire-biencoder-compsci-spec?

The aspire-biencoder-compsci-spec is a sophisticated BERT-based bi-encoder model designed specifically for analyzing similarity between scientific documents. Initialized from the SPECTER model, it represents documents using a unique approach that combines title and abstract information into a single vector through a scalar mix of CLS tokens across all encoder layers.

Implementation Details

The model was trained on an extensive dataset of 1.2 million biomedical paper pairs using a contrastive learning approach. Training utilized the Adam Optimizer with a 2e-5 learning rate, incorporating 1000 warm-up steps and linear decay. The model leverages co-citation patterns in scientific literature, where papers cited together in the same context are used as training pairs.

  • Training utilizes in-batch negative sampling for contrastive learning
  • Implements scalar mix parameters for enhanced performance
  • Focuses on title-abstract pair representations
  • Achieves superior performance compared to SPECTER baseline (37.17 vs 34.23 MAP on CSFCube)

Core Capabilities

  • Fine-grained scientific document similarity analysis
  • Document representation through single vector encoding
  • Specialized performance in computer science domain
  • Adaptable for classification tasks through fine-tuning
  • Efficient processing of title-abstract combinations

Frequently Asked Questions

Q: What makes this model unique?

This model's uniqueness lies in its training on co-citation data rather than direct citations, allowing it to capture more nuanced relationships between scientific documents. It also employs a sophisticated scalar mix approach for token representation that can be crucial for performance in specific datasets.

Q: What are the recommended use cases?

The model is best suited for computer science document similarity tasks, particularly when working with paper titles and abstracts. While primarily designed for similarity assessment, it can be fine-tuned for classification tasks. However, performance may vary when applied to domains outside computer science.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.