deid_roberta_i2b2
Property | Value |
---|---|
License | MIT |
Base Architecture | RoBERTa |
Primary Paper | RoBERTa Paper |
Task Type | Token Classification |
What is deid_roberta_i2b2?
deid_roberta_i2b2 is a specialized RoBERTa-based model designed for de-identifying medical notes and electronic health records (EHR). Built on the robust RoBERTa architecture, this model excels at identifying and classifying protected health information (PHI) across 11 distinct categories, implementing BILOU (Begin, Inside, Last, Outside, Unit) tagging for precise entity recognition.
Implementation Details
The model is trained on the I2B2 2014 dataset, processing medical notes through a sophisticated pipeline that includes sentence segmentation using spaCy's en_core_sci_sm and custom tokenization. Each input sequence is limited to 128 tokens, with 32 tokens of context added from both preceding and following sentences.
- Training utilizes AdamW optimizer with 5e-5 learning rate
- Implements batch size of 32 (16 with gradient accumulation)
- Incorporates 0.1 dropout for regularization
- Processes 11 PHI categories including DATE, STAFF, HOSPITAL, AGE, etc.
Core Capabilities
- Accurate identification of PHI entities in medical text
- Support for all HIPAA-mandated PHI categories
- Context-aware processing with sliding window approach
- Robust performance across various medical document types
Frequently Asked Questions
Q: What makes this model unique?
The model's specialization in medical text de-identification, combined with its context-aware processing and comprehensive coverage of PHI categories, makes it particularly effective for healthcare applications. With over 866,000 downloads, it has proven its utility in real-world scenarios.
Q: What are the recommended use cases?
This model is ideal for healthcare organizations needing to de-identify medical records, research institutions processing clinical data, and any entity handling protected health information that must comply with HIPAA regulations.