roberta-base-biomedical-clinical-es

Maintained By
PlanTL-GOB-ES

roberta-base-biomedical-clinical-es

PropertyValue
LicenseApache 2.0
LanguageSpanish
PaperView Paper
Training Data1B+ tokens

What is roberta-base-biomedical-clinical-es?

This is a specialized RoBERTa-based language model designed specifically for Spanish biomedical and clinical text processing. Developed by the Text Mining Unit at Barcelona Supercomputing Center, it has been trained on a massive corpus of over 1 billion tokens from various medical sources, including clinical documents, research papers, and medical crawl data.

Implementation Details

The model utilizes a byte-level BPE tokenizer with a 52,000 token vocabulary. Training was conducted using 16 NVIDIA V100 GPUs over 48 hours, implementing the masked language modeling approach with Adam optimizer and a 0.0005 peak learning rate. The training corpus combines cleaned biomedical texts with uncleaned clinical notes to maintain authentic medical language patterns.

  • Trained on diverse medical sources including hospital discharge reports, clinical cases, and scientific publications
  • Implements masked language modeling for fill-mask tasks
  • Achieves state-of-the-art performance on Spanish medical NER tasks

Core Capabilities

  • Masked language modeling for medical text completion
  • Named Entity Recognition for medical terms (90.04% F1 score on PharmaCoNER)
  • Clinical text understanding and processing
  • Biomedical document analysis

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its specialized focus on Spanish medical text and its comprehensive training data from real clinical documents. It outperforms general-purpose models like mBERT and BETO on medical NER tasks, showing significant improvements in understanding medical terminology and context.

Q: What are the recommended use cases?

The model is primarily designed for masked language modeling tasks in medical contexts but can be fine-tuned for various downstream tasks including named entity recognition, text classification, and medical document analysis. It's particularly suitable for processing Spanish clinical notes, medical research papers, and healthcare-related content.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.