Apache Spark in the Cloud Training Course – Bluechip AI Asia, AI Development Company

Duration

21 hours (usually 3 days including breaks)

Requirements

Programing skills (preferably python, scala)

SQL basics

Overview

Apache Spark’s learning curve is slowly increasing at the begining, it needs a lot of effort to get the first return. This course aims to jump through the first tough part. After taking this course the participants will understand the basics of Apache Spark , they will clearly differentiate RDD from DataFrame, they will learn Python and Scala API, they will understand executors and tasks, etc. Also following the best practices, this course strongly focuses on cloud deployment, Databricks and AWS. The students will also understand the differences between AWS EMR and AWS Glue, one of the lastest Spark service of AWS.

AUDIENCE:

Data Engineer, DevOps, Data Scientist

Course Outline

Introduction:

Apache Spark in Hadoop Ecosystem
Short intro for python, scala

Basics (theory):

Architecture
RDD
Transformation and Actions
Stage, Task, Dependencies

Using Databricks environment understand the basics (hands-on workshop):

Exercises using RDD API
Basic action and transformation functions
PairRDD
Join
Caching strategies
Exercises using DataFrame API
SparkSQL
DataFrame: select, filter, group, sort
UDF (User Defined Function)
Looking into DataSet API
Streaming

Using AWS environment understand the deployment (hands-on workshop):

Basics of AWS Glue
Understand differencies between AWS EMR and AWS Glue
Example jobs on both environment
Understand pros and cons

Extra:

Introduction to Apache Airflow orchestration

Duration

Requirements

Overview

Course Outline

Leave a Reply Cancel reply