BioLORD-2023-M
Property | Value |
---|---|
Parameter Count | 278M |
Supported Languages | English, Spanish, French, German, Dutch, Danish, Swedish |
License | IHTSDO and NLM Licenses |
Paper | BioLORD-2023 Paper |
Base Architecture | XLM-RoBERTa |
What is BioLORD-2023-M?
BioLORD-2023-M is a state-of-the-art multilingual biomedical language model designed for producing meaningful representations of clinical sentences and biomedical concepts. Built on sentence-transformers architecture, it employs a novel pre-training strategy that grounds concept representations using definitions and knowledge graph descriptions.
Implementation Details
The model implements a three-phase training strategy: contrastive learning, definition-based training, and self-distillation. It maps sentences and paragraphs to a 768-dimensional dense vector space, making it particularly effective for clustering and semantic search in medical contexts.
- Built on sentence-transformers/all-mpnet-base-v2 architecture
- Trained on BioLORD-Dataset and AGCT-Dataset
- Implements advanced knowledge graph integration
- Supports both sentence and phrase embeddings
Core Capabilities
- Multilingual medical text similarity analysis
- Biomedical concept representation
- Clinical sentence embedding
- Cross-lingual medical information processing
- Semantic search in medical documents
Frequently Asked Questions
Q: What makes this model unique?
BioLORD-2023-M stands out for its innovative approach to grounding concept representations using definitions and knowledge graph information, resulting in more semantic and hierarchically aware representations than traditional models.
Q: What are the recommended use cases?
The model excels in processing medical documents, EHR records, and clinical notes, particularly for tasks requiring semantic understanding across multiple European languages. It's ideal for healthcare institutions requiring multilingual capability in their NLP pipelines.