roberta-base-biomedical-clinical-es

roberta-base-biomedical-clinical-es

PlanTL-GOB-ES

A Spanish biomedical RoBERTa model trained on 1B+ tokens of clinical text, achieving SOTA results on medical NER tasks with 90.04% F1 score.

PropertyValue
LicenseApache 2.0
LanguageSpanish
PaperView Paper
Training Data1B+ tokens

What is roberta-base-biomedical-clinical-es?

This is a specialized RoBERTa-based language model designed specifically for Spanish biomedical and clinical text processing. Developed by the Text Mining Unit at Barcelona Supercomputing Center, it has been trained on a massive corpus of over 1 billion tokens from various medical sources, including clinical documents, research papers, and medical crawl data.

Implementation Details

The model utilizes a byte-level BPE tokenizer with a 52,000 token vocabulary. Training was conducted using 16 NVIDIA V100 GPUs over 48 hours, implementing the masked language modeling approach with Adam optimizer and a 0.0005 peak learning rate. The training corpus combines cleaned biomedical texts with uncleaned clinical notes to maintain authentic medical language patterns.

  • Trained on diverse medical sources including hospital discharge reports, clinical cases, and scientific publications
  • Implements masked language modeling for fill-mask tasks
  • Achieves state-of-the-art performance on Spanish medical NER tasks

Core Capabilities

  • Masked language modeling for medical text completion
  • Named Entity Recognition for medical terms (90.04% F1 score on PharmaCoNER)
  • Clinical text understanding and processing
  • Biomedical document analysis

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its specialized focus on Spanish medical text and its comprehensive training data from real clinical documents. It outperforms general-purpose models like mBERT and BETO on medical NER tasks, showing significant improvements in understanding medical terminology and context.

Q: What are the recommended use cases?

The model is primarily designed for masked language modeling tasks in medical contexts but can be fine-tuned for various downstream tasks including named entity recognition, text classification, and medical document analysis. It's particularly suitable for processing Spanish clinical notes, medical research papers, and healthcare-related content.

Related Models

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026