bsc-bio-ehr-es

PlanTL-GOB-ES

Spanish biomedical language model trained on 1.1B tokens, optimized for clinical NLP tasks. Achieves SOTA performance on NER tasks.

Property	Value
Developer	Text Mining Unit (TeMU) at Barcelona Supercomputing Center
License	Apache License 2.0
Training Data	1.1B tokens of biomedical text
Architecture	RoBERTa-based
Paper	Publication Link

What is bsc-bio-ehr-es?

bsc-bio-ehr-es is a specialized Spanish language model designed for biomedical and clinical natural language processing tasks. Trained on an extensive corpus of over 1.1 billion tokens, including 95M tokens from real electronic health records, it represents the first large-scale biomedical Spanish language model trained from scratch.

Implementation Details

The model implements a RoBERTa architecture trained using byte-level BPE tokenization with a 52,000 token vocabulary. Training was conducted over 48 hours using 16 NVIDIA V100 GPUs, employing Adam optimizer with a 0.0005 peak learning rate and 2,048 sentence batch size.

Trained on diverse medical sources including clinical documents, scientific publications, and medical patents
Incorporates both cleaned biomedical corpora and authentic clinical text
Utilizes advanced preprocessing techniques while preserving clinical language characteristics

Core Capabilities

Masked Language Modeling for Fill Mask tasks
Superior performance in Named Entity Recognition (NER) tasks
Achieves state-of-the-art results on PharmaCoNER (0.8913 F1), CANTEMIST (0.8340 F1), and ICTUSnet (0.8756 F1)
Specifically optimized for Spanish clinical text analysis

Frequently Asked Questions

Q: What makes this model unique?

This is the first large-scale Spanish biomedical language model trained from scratch, combining both biomedical literature and real clinical documents. It consistently outperforms both general-domain and other domain-specific models in clinical NER tasks.

Q: What are the recommended use cases?

The model is primarily designed for masked language modeling tasks but is intended to be fine-tuned for downstream tasks such as Named Entity Recognition or Text Classification in Spanish medical contexts. It's particularly effective for processing clinical documents and biomedical literature.