Duration
21 hours (usually 3 days including breaks)
Requirements
- Experience with the Linux command line
- A general understanding of data processing
- Programming experience with Java, Scala, Python, or R
Audience
- Developers
Overview
Apache Spark is an analytics engine designed to distribute data across a cluster in order to process it in parallel. It contains modules for streaming, SQL, machine learning and graph processing.
This instructor-led, live training (online or onsite) is aimed at engineers who wish to deploy Apache Spark system for processing very large amounts of data.
By the end of this training, participants will be able to:
- Install and configure Apache Spark.
- Understand the difference between Apache Spark and Hadoop MapReduce and when to use which.
- Quickly read in and analyze very large data sets.
- Integrate Apache Spark with other machine learning tools.
Format of the Course
- Interactive lecture and discussion.
- Lots of exercises and practice.
- Hands-on implementation in a live-lab environment.
Course Customization Options
- To request a customized training for this course, please contact us to arrange.
Course Outline
Introduction
- Apache Spark vs Hadoop MapReduce
Overview of Apache Spark Features and Architecture
Choosing a Programming Language
Setting up Apache Spark
Creating a Sample Application
Choosing the Data Set
Running Data Analysis on the Data
Processing of Structured Data with Spark SQL
Processing Streaming Data with Spark Streaming
Integrating Apache Spark with 3rd Part Machine Learning Tools
Using Apache Spark for Graph Processing
Optimizing Apache Spark
Troubleshooting
Summary and Conclusion