bsc-bio-ehr-es

Maintained By
PlanTL-GOB-ES

bsc-bio-ehr-es

PropertyValue
DeveloperText Mining Unit (TeMU) at Barcelona Supercomputing Center
LicenseApache License 2.0
Training Data1.1B tokens of biomedical text
ArchitectureRoBERTa-based
PaperPublication Link

What is bsc-bio-ehr-es?

bsc-bio-ehr-es is a specialized Spanish language model designed for biomedical and clinical natural language processing tasks. Trained on an extensive corpus of over 1.1 billion tokens, including 95M tokens from real electronic health records, it represents the first large-scale biomedical Spanish language model trained from scratch.

Implementation Details

The model implements a RoBERTa architecture trained using byte-level BPE tokenization with a 52,000 token vocabulary. Training was conducted over 48 hours using 16 NVIDIA V100 GPUs, employing Adam optimizer with a 0.0005 peak learning rate and 2,048 sentence batch size.

  • Trained on diverse medical sources including clinical documents, scientific publications, and medical patents
  • Incorporates both cleaned biomedical corpora and authentic clinical text
  • Utilizes advanced preprocessing techniques while preserving clinical language characteristics

Core Capabilities

  • Masked Language Modeling for Fill Mask tasks
  • Superior performance in Named Entity Recognition (NER) tasks
  • Achieves state-of-the-art results on PharmaCoNER (0.8913 F1), CANTEMIST (0.8340 F1), and ICTUSnet (0.8756 F1)
  • Specifically optimized for Spanish clinical text analysis

Frequently Asked Questions

Q: What makes this model unique?

This is the first large-scale Spanish biomedical language model trained from scratch, combining both biomedical literature and real clinical documents. It consistently outperforms both general-domain and other domain-specific models in clinical NER tasks.

Q: What are the recommended use cases?

The model is primarily designed for masked language modeling tasks but is intended to be fine-tuned for downstream tasks such as Named Entity Recognition or Text Classification in Spanish medical contexts. It's particularly effective for processing clinical documents and biomedical literature.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.