Spark NLP for Data Scientists

Live Online Certification Training
April 22 | 12 - 4 PM EDT

Register Now


Natural Language Processing (NLP) is a key component in many data science systems that must understand or reason about text. Common use cases include knowledge extraction, question answering, entity recognition, spell correction, sentiment analysis, and document classification.

This four-hour workshop will walk you through state-of-the-art natural language processing (NLP) using John Snow Labs’ open-source Spark NLP library. This is a hands-on workshop that will enable you to write, edit, and run live Python notebooks that cover the majority of the open-source library’s functionality.

The workshop is organized in three hour-long sessions, each followed by 30 minutes of self-guided coding, on Python notebooks relevant to each section. This is a live online workshop and the instructors will be available during the self-guided sessions to answer questions.

This workshop is part of the recommended preparation for the “Certified Spark NLP Data Scientist” certification exam.


Part I (90 minutes): Overview, Core Concepts, and Pre-Trained NLP Pipelines

  • Introduction to Spark NLP
  • Architecture and design goals
  • Core concepts: Pipelines, Annotators, Resources
  • Getting things done with pre-trained pipelines
  • Working with words: Sentence boundary detection, tokenization, stemming, lemmatization
  • Cleaning text: Normalization, stop-word remover, spell checking
  • Connecting words: Part of speech tagging, chunking, N-gram generator
  • Finding in text: Text matcher, date matcher, regex matcher, dependency parser

Part II (90 minutes): Custom NLP Pipelines & Named entity recognition

  • Building & configuring your own NLP pipeline
  • Understanding the Pipeline API and fit(), annotate(), and transform()
  • Named entity recognition (NER)
  • Using pre-trained NER models
  • Training your own NER model
  • Understanding and choosing embeddings
  • Using word, sentence, document, and universal embeddings

Part III (60 minutes): Document Classification and Inference

  • Understanding document classification use cases
  • Sentiment analysis annotators & models
  • Training your own document classifier
  • Integrating with other machine learning frameworks
  • Saving, loading, and sharing NLP models
  • Using LightPipeline for low-latency inference


  • A working knowledge of Python
  • Familiarity with the basics of machine learning, deep learning, and Apache Spark


  • A laptop with the tutorial environment installed
  • Complete the setup instructions (to be emailed before the workshop)


  • Gain hands-on experience building complete NLP pipelines in Python
  • Understand the different features and tasks that NLP pipelines include
  • Know which pre-trained models are available with Spark NLP and how to use them
  • Understand when and how to train your own NLP models
  • Understand how to apply state-of-the-art deep learning, transfer learning, and transformers in day-to-day NLP use

Register now




John Snow Labs is an award-winning AI and NLP company, accelerating progress in data science by providing state-of-the-art models, data and platforms. Founded in 2015, the company helps healthcare and life sciences companies – include Roche, Kaiser Permanente, Intel, and UCB – build, scale, deploy, and operate AI products and services. The company is the winner of CIO Review’s AI Solution Provider of the Year in 2018 and CIO Application’s AI Platform of the Year in 2019. It also won the Strata Data Award in 2019 for delivering Spark NLP – the world’s most widely used NLP library in the enterprise. John Snow Labs is a global team of data specialists – a third of the team has a Ph.D. or M.D. degree and 75% hold at least a master’s degree in disciplines covering data science, medicine, data engineering, pharma, data security, and DataOps. 


image (2)   image (3)StrataData-2019.  .