Accurate Table Extraction from Documents & Images with Spark OCR

Extracting data formatted as a table (tabular data) is a common task — whether you’re analyzing financial statements, academic research papers, or clinical trial documentation. Table-based information varies heavily in appearance, fonts, borders, and layouts. This makes the data extraction task challenging even when the text is searchable – but more so when the table is only available as an image.

This webinar presents how Spark OCR automatically extracts tabular data from images. This end-to-end solution includes computer vision models for table detection and table structure recognition, as well as OCR models for extracting text & numbers from each cell. The implemented approach provides state-of-the-art accuracy for the ICDAR 2013 and TableBank benchmark datasets.

Mykola Melnyk

Mykola Melnyk is a senior Scala, Python, and Spark software engineer with 15 years of industry experience. He has led teams and projects building machine learning and big data solutions in a variety of industries - and is currently the lead developer of the Spark OCR library at John Snow Labs.

Webinar sign up

Accurate Table Extraction from Documents & Images with Spark OCR

Watch live on August 11th @ 2:00 p.m. ET

Mykola Melnyk

Recent blog posts

Get to work

Let’s Talk!

Join our newsletter