Apache Spark Fundamentals Training Course

Duration

21 hours (usually 3 days including breaks)

Requirements

  • Experience with the Linux command line
  • A general understanding of data processing
  • Programming experience with Java, Scala, Python, or R

Audience

  • Developers

Overview

Apache Spark is an analytics engine designed to distribute data across a cluster in order to process it in parallel. It contains modules for streaming, SQL, machine learning and graph processing.

This instructor-led, live training (online or onsite) is aimed at engineers who wish to deploy Apache Spark system for processing very large amounts of data.

By the end of this training, participants will be able to:

  • Install and configure Apache Spark.
  • Understand the difference between Apache Spark and Hadoop MapReduce and when to use which.
  • Quickly read in and analyze very large data sets.
  • Integrate Apache Spark with other machine learning tools.

Format of the Course

  • Interactive lecture and discussion.
  • Lots of exercises and practice.
  • Hands-on implementation in a live-lab environment.

Course Customization Options

  • To request a customized training for this course, please contact us to arrange.

Course Outline

Introduction

  • Apache Spark vs Hadoop MapReduce

Overview of Apache Spark Features and Architecture

Choosing a Programming Language

Setting up Apache Spark

Creating a Sample Application

Choosing the Data Set

Running Data Analysis on the Data

Processing of Structured Data with Spark SQL

Processing Streaming Data with Spark Streaming

Integrating Apache Spark with 3rd Part Machine Learning Tools

Using Apache Spark for Graph Processing

Optimizing Apache Spark

Troubleshooting

Summary and Conclusion

Leave a Reply

Your email address will not be published. Required fields are marked *