stanford-deidentifier-base
Property | Value |
---|---|
License | MIT |
Framework | PyTorch, Transformers |
Domain | Radiology, Biomedical |
Paper | View Research Paper |
What is stanford-deidentifier-base?
Stanford-deidentifier-base is a sophisticated machine learning model designed to automatically remove protected health information (PHI) from medical documents, particularly radiology reports. Developed by StanfordAIMI, this model achieves exceptional performance with F1 scores of 97.9+ on various test sets, making it suitable for production environments.
Implementation Details
The model implements a transformer-based architecture, specifically built on PubMedBERT (uncased), and combines both transformer and rule-based methods for optimal de-identification. It was trained on a diverse dataset of 6,193 documents, including chest X-ray reports, CT scans, and medical notes from multiple institutions.
- Built on PubMedBERT architecture
- Trained on multi-institutional dataset
- Implements token classification for PHI detection
- Includes synthetic PHI generation capabilities
Core Capabilities
- Achieves 97.9 F1 score on known institution reports
- 99.6 F1 score on new institution reports
- 99.5 F1 score on i2b2 2006 dataset
- 98.9 F1 score on i2b2 2014 dataset
- Automatic replacement of PHI with realistic surrogates
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its combination of transformer-based learning and "hide in plain sight" rule-based methods, achieving state-of-the-art performance that exceeds both existing tools and human labelers on i2b2 2014 data.
Q: What are the recommended use cases?
The model is specifically designed for de-identifying radiology reports and other medical documents in production environments where high accuracy is crucial. It's particularly effective for healthcare institutions needing to process large volumes of medical documents while maintaining patient privacy.