ODSC East 2020

Automating Clinical Data Abstraction From Unstructured Documents Using Spark NLP



Building databases that track real patients’ stories over time is essential for medical research, drug development, clinical quality improvement, population health, and chronic disease management. Doing this well traditionally presents three key challenges. First, a lot of relevant information such as patient demographics, comorbidities, history, and social determinants of health is only available in free-text documents and notes. Second, there are gaps and conflicts between different data points about each patient which must be resolved. Third, a large number of both patients and variables are required to make most analyses useful – which in turn means that building these databases manually is often impractical.
This session describes these challenges in the context of real-world projects and use cases. We’ll then cover how recent advances in natural language processing (NLP) and transfer learning have changed the game in terms of achievable accuracy and scale. Results and benchmarks from doing so using Spark NLP for Healthcare will be shared, as well as best practices and lessons learned from early adopters of the technology.
About the speaker


David Talby is a CTO at John Snow Labs, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams.

Previously, he was with Microsoft’s Bing Group, where he led business operations for Bing Shopping in the US and Europe and worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a Ph.D. in computer science and master’s degrees in both computer science and business administration.