Watch the Webinar

Accurate de-identification, obfuscation, and editing of scanned medical documents and images

 

 

One kind of noisy data that healthcare data scientists deal with is scanned documents and images: from PDF attachments of lab results, referrals, or genetic testing to DICOM files with medical imaging. These files are challenging to de-identify, because personal health information (PHI) can appear anywhere in free text – so cannot be removed with rules or regular expressions – or “burned” into images so that it’s not even available as digital text to begin with.

This webinar presents a software system that tackles these challenges, with lessons learned from applying it in real-world production systems. The workflow uses:

  • Spark OCR to extract both digital and scanned text from PDF and DICOM files
  • Spark NLP for Healthcare to recognize sensitive data in the extracted free text
  • The de-identification module to delete, replace, or obfuscate PHI
  • Spark OCR to generate new PDF or DICOM file with the de-identified data
  • Run the whole workflow within a local secure environment, with no need to share data with any third party or a public cloud API

 

Alina Petukhova

Dr. Alina Petukhova

Data Scientist at John Snow Labs