stanford-deidentifier-base

Property	Value
License	MIT
Framework	PyTorch, Transformers
Domain	Radiology, Biomedical
Paper	View Research Paper

What is stanford-deidentifier-base?

Stanford-deidentifier-base is a sophisticated machine learning model designed to automatically remove protected health information (PHI) from medical documents, particularly radiology reports. Developed by StanfordAIMI, this model achieves exceptional performance with F1 scores of 97.9+ on various test sets, making it suitable for production environments.

Implementation Details

The model implements a transformer-based architecture, specifically built on PubMedBERT (uncased), and combines both transformer and rule-based methods for optimal de-identification. It was trained on a diverse dataset of 6,193 documents, including chest X-ray reports, CT scans, and medical notes from multiple institutions.

Built on PubMedBERT architecture
Trained on multi-institutional dataset
Implements token classification for PHI detection
Includes synthetic PHI generation capabilities

Core Capabilities

Achieves 97.9 F1 score on known institution reports
99.6 F1 score on new institution reports
99.5 F1 score on i2b2 2006 dataset
98.9 F1 score on i2b2 2014 dataset
Automatic replacement of PHI with realistic surrogates

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its combination of transformer-based learning and "hide in plain sight" rule-based methods, achieving state-of-the-art performance that exceeds both existing tools and human labelers on i2b2 2014 data.

Q: What are the recommended use cases?

The model is specifically designed for de-identifying radiology reports and other medical documents in production environments where high accuracy is crucial. It's particularly effective for healthcare institutions needing to process large volumes of medical documents while maintaining patient privacy.