deid_roberta_i2b2

Property	Value
License	MIT
Base Architecture	RoBERTa
Primary Paper	RoBERTa Paper
Task Type	Token Classification

What is deid_roberta_i2b2?

deid_roberta_i2b2 is a specialized RoBERTa-based model designed for de-identifying medical notes and electronic health records (EHR). Built on the robust RoBERTa architecture, this model excels at identifying and classifying protected health information (PHI) across 11 distinct categories, implementing BILOU (Begin, Inside, Last, Outside, Unit) tagging for precise entity recognition.

Implementation Details

The model is trained on the I2B2 2014 dataset, processing medical notes through a sophisticated pipeline that includes sentence segmentation using spaCy's en_core_sci_sm and custom tokenization. Each input sequence is limited to 128 tokens, with 32 tokens of context added from both preceding and following sentences.

Training utilizes AdamW optimizer with 5e-5 learning rate
Implements batch size of 32 (16 with gradient accumulation)
Incorporates 0.1 dropout for regularization
Processes 11 PHI categories including DATE, STAFF, HOSPITAL, AGE, etc.

Core Capabilities

Accurate identification of PHI entities in medical text
Support for all HIPAA-mandated PHI categories
Context-aware processing with sliding window approach
Robust performance across various medical document types

Frequently Asked Questions

Q: What makes this model unique?

The model's specialization in medical text de-identification, combined with its context-aware processing and comprehensive coverage of PHI categories, makes it particularly effective for healthcare applications. With over 866,000 downloads, it has proven its utility in real-world scenarios.

Q: What are the recommended use cases?

This model is ideal for healthcare organizations needing to de-identify medical records, research institutions processing clinical data, and any entity handling protected health information that must comply with HIPAA regulations.