Duration
7 hours (usually 1 day including breaks)
Requirements
- Experience with Python and Apache Kafka
- Familiarity with stream-processing platforms
Audience
- Data engineers
- Data scientists
- Programmers
Overview
Apache Spark Streaming is a scalable, open source stream processing system that allows users to process real-time data from supported sources. Spark Streaming enables fault-tolerant processing of data streams.
This instructor-led, live training (online or onsite) is aimed at data engineers, data scientists, and programmers who wish to use Spark Streaming features in processing and analyzing real-time data.
By the end of this training, participants will be able to use Spark Streaming to process live data streams for use in databases, filesystems, and live dashboards.
Format of the Course
- Interactive lecture and discussion.
- Lots of exercises and practice.
- Hands-on implementation in a live-lab environment.
Course Customization Options
- To request a customized training for this course, please contact us to arrange.
Course Outline
Introduction
Overview of Spark Streaming Features and Architecture
- Supported data sources
- Core APIs
Preparing the Environment
- Dependencies
- Spark and streaming context
- Connecting to Kafka
Processing Messages
- Parsing inbound messages as JSON
- ETL processes
- Starting the streaming context
Performing a Windowed Stream Processing
- Slide interval
- Checkpoint delivery configuration
- Launching the environment
Prototyping the Processing Code
- Connecting to a Kafka topic
- Retrieving JSON from data source using Paw
- Variations and additional processing
Streaming the Code
- Job control variables
- Defining values to match
- Functions and conditions
Acquiring Stream Output
- Counters
- Kafka output (matched and non-matched)
Troubleshooting
Summary and Conclusion