Scaling Data Pipelines with Spark NLP Training Course

Duration

14 hours (usually 2 days including breaks)

Requirements

  • Familiarity with Apache Spark
  • Python programming experience

Audience

  • Data scientists
  • Developers

Overview

Spark NLP is an open source library, built on Apache Spark, for natural language processing with Python, Java, and Scala. It is widely used for enterprise and industry verticals, such as healthcare, finance, life science, and recruiting.

This instructor-led, live training (online or onsite) is aimed at data scientists and developers who wish to use Spark NLP, built on top of Apache Spark, to develop, implement, and scale natural language text processing models and pipelines.

By the end of this training, participants will be able to:

  • Set up the necessary development environment to start building NLP pipelines with Spark NLP.
  • Understand the features, architecture, and benefits of using Spark NLP.
  • Use the pre-trained models available in Spark NLP to implement text processing.
  • Learn how to build, train, and scale Spark NLP models for production-grade projects.
  • Apply classification, inference, and sentiment analysis on real-world use cases (clinical data, customer behavior insights, etc.).

Format of the Course

  • Interactive lecture and discussion.
  • Lots of exercises and practice.
  • Hands-on implementation in a live-lab environment.

Course Customization Options

  • To request a customized training for this course, please contact us to arrange.

Course Outline

Introduction

  • Spark NLP vs NLTK vs spaCy
  • Overview of Spark NLP features and architecture

Getting Started

  • Setup requirements
  • Installing Spark NLP
  • General concepts

Using Pre-trained Pipelines

  • Importing required modules
  • Default annotators
  • Loading a pipeline model
  • Transforming texts

Building NLP Pipelines

  • Understanding the pipeline API
  • Implementing NER models
  • Choosing embeddings
  • Using word, sentence, and universal embeddings

Classification and Inference

  • Document classification use cases
  • Sentiment analysis models
  • Training a document classifier
  • Using other machine learning frameworks
  • Managing NLP models
  • Optimizing models for low-latency inference

Troubleshooting

Summary and Next Steps