Hadoop and Spark for Administrators Training Course

Duration

35 hours (usually 5 days including breaks)

Requirements

  • System administration experience
  • Experience with Linux command line
  • An understanding of big data concepts

Audience

  • System administrators
  • DBAs

Overview

Apache Hadoop is a popular data processing framework for processing large data sets across many computers.

This instructor-led, live training (online or onsite) is aimed at system administrators who wish to learn how to set up, deploy and manage Hadoop clusters within their organization.

By the end of this training, participants will be able to:

  • Install and configure Apache Hadoop.
  • Understand the four major components in the Hadoop ecoystem: HDFS, MapReduce, YARN, and Hadoop Common.
  • Use Hadoop Distributed File System (HDFS) to scale a cluster to hundreds or thousands of nodes.  
  • Set up HDFS to operate as storage engine for on-premise Spark deployments.
  • Set up Spark to access alternative storage solutions such as Amazon S3 and NoSQL database systems such as Redis, Elasticsearch, Couchbase, Aerospike, etc.
  • Carry out administrative tasks such as provisioning, management, monitoring and securing an Apache Hadoop cluster.

Format of the Course

  • Interactive lecture and discussion.
  • Lots of exercises and practice.
  • Hands-on implementation in a live-lab environment.

Course Customization Options

  • To request a customized training for this course, please contact us to arrange.

Course Outline

Introduction

  • Introduction to Cloud Computing and Big Data solutions
  • Overview of Apache Hadoop Features and Architecture

Setting up Hadoop

  • Planning a Hadoop cluster (on-premise, cloud, etc.)
  • Selecting the OS and Hadoop distribution
  • Provisioning resources (hardware, network, etc.)
  • Downloading and installing the software
  • Sizing the cluster for flexibility

Working with HDFS

  • Understanding the Hadoop Distributed File System (HDFS)
  • Overview of HDFS Command Reference
  • Accessing HDFS
  • Performing Basic File Operations on HDFS
  • Using S3 as a complement to HDFS

Overview of the MapReduce

  • Understanding Data Flow in the MapReduce Framework
  • Map, Shuffle, Sort and Reduce
  • Demo: Computing Top Salaries

Working with YARN

  • Understanding resource management in Hadoop
  • Working with ResourceManager, NodeManager, Application Master
  • Scheduling jobs under YARN
  • Scheduling for large numbers of nodes and clusters
  • Demo: Job scheduling

Integrating Hadoop with Spark

  • Setting up storage for Spark (HDFS, Amazon, S3, NoSQL, etc.)
  • Understanding Resilient Distributed Datasets (RDDs)
  • Creating an RDD
  • Implementing RDD Transformations
  • Demo: Implementing a Text Search Program for Movie Titles

Managing a Hadoop Cluster

  • Monitoring Hadoop
  • Securing a Hadoop cluster
  • Adding and removing nodes
  • Running a performance benchmark
  • Tuning a Hadoop cluster to optimizing performance
  • Backup, recovery and business continuity planning
  • Ensuring high availability (HA)

Upgrading and Migrating a Hadoop Cluster

  • Assessing workload requirements
  • Upgrading Hadoop
  • Moving from on-premise to cloud and vice-versa
  • Recovering from failures

Troubleshooting

Summary and Conclusion

Magellan: Magellanal Analytics on Spark Training Course

Duration

14 hours (usually 2 days including breaks)

Requirements

  • Experience with Apache Spark
  • Experience with SQL and data frame queries

Audience

  • Application developers

Overview

Magellan is an open-source distributed execution engine for geospatial analytics on big data. Implemented on top of Apache Spark, it extends Spark SQL and provides a relational abstraction for geospatial analytics.

This instructor-led, live training introduces the concepts and approaches for implementing geospacial analytics and walks participants through the creation of a predictive analysis application using Magellan on Spark.

By the end of this training, participants will be able to:

  • Efficiently query, parse and join geospatial datasets at scale
  • Implement geospatial data in business intelligence and predictive analytics applications
  • Use spatial context to extend the capabilities of mobile devices, sensors, logs, and wearables

Format of the Course

  • Interactive lecture and discussion.
  • Lots of exercises and practice.
  • Hands-on implementation in a live-lab environment.

Course Customization Options

  • To request a customized training for this course, please contact us to arrange.

Course Outline

To request a customized course outline for this training, please contact us to arrange.

Spark for Developers Training Course

Duration

21 hours (usually 3 days including breaks)

Requirements

PRE-REQUISITES

familiarity with either Java / Scala / Python language (our labs in Scala and Python)
basic understanding of Linux development environment (command line navigation / editing files using VI or nano)

Overview

OBJECTIVE:

This course will introduce Apache Spark. The students will learn how  Spark fits  into the Big Data ecosystem, and how to use Spark for data analysis.  The course covers Spark shell for interactive data analysis, Spark internals, Spark APIs, Spark SQL, Spark streaming, and machine learning and graphX.

AUDIENCE :

Developers / Data Analysts

Course Outline

  1. Scala primer
    • A quick introduction to Scala
    • Labs : Getting know Scala
  2. Spark Basics
    • Background and history
    • Spark and Hadoop
    • Spark concepts and architecture
    • Spark eco system (core, spark sql, mlib, streaming)
    • Labs : Installing and running Spark
  3. First Look at Spark
    • Running Spark in local mode
    • Spark web UI
    • Spark shell
    • Analyzing dataset – part 1
    • Inspecting RDDs
    • Labs: Spark shell exploration
  4. RDDs
    • RDDs concepts
    • Partitions
    • RDD Operations / transformations
    • RDD types
    • Key-Value pair RDDs
    • MapReduce on RDD
    • Caching and persistence
    • Labs : creating & inspecting RDDs;   Caching RDDs
  5. Spark API programming
    • Introduction to Spark API / RDD API
    • Submitting the first program to Spark
    • Debugging / logging
    • Configuration properties
    • Labs : Programming in Spark API, Submitting jobs
  6. Spark SQL
    • SQL support in Spark
    • Dataframes
    • Defining tables and importing datasets
    • Querying data frames using SQL
    • Storage formats : JSON / Parquet
    • Labs : Creating and querying data frames; evaluating data formats
  7. MLlib
    • MLlib intro
    • MLlib algorithms
    • Labs : Writing MLib applications
  8. GraphX
    • GraphX library overview
    • GraphX APIs
    • Labs : Processing graph data using Spark
  9. Spark Streaming
    • Streaming overview
    • Evaluating Streaming platforms
    • Streaming operations
    • Sliding window operations
    • Labs : Writing spark streaming applications
  10. Spark and Hadoop
    • Hadoop Intro (HDFS / YARN)
    • Hadoop + Spark architecture
    • Running Spark on Hadoop YARN
    • Processing HDFS files using Spark
  11. Spark Performance and Tuning
    • Broadcast variables
    • Accumulators
    • Memory management & caching
  12. Spark Operations
    • Deploying Spark in production
    • Sample deployment templates
    • Configurations
    • Monitoring
    • Troubleshooting

Python and Spark for Big Data (PySpark) Training Course

Duration

21 hours (usually 3 days including breaks)

Requirements

  • General programming skills

Audience

  • Developers
  • IT Professionals
  • Data Scientists

Overview

Python is a high-level programming language famous for its clear syntax and code readibility. Spark is a data processing engine used in querying, analyzing, and transforming big data. PySpark allows users to interface Spark with Python.

In this instructor-led, live training, participants will learn how to use Python and Spark together to analyze big data as they work on hands-on exercises.

By the end of this training, participants will be able to:

  • Learn how to use Spark with Python to analyze Big Data.
  • Work on exercises that mimic real world cases.
  • Use different tools and techniques for big data analysis using PySpark.

Format of the course

  • Part lecture, part discussion, exercises and heavy hands-on practice

Course Outline

Introduction

Understanding Big Data

Overview of Spark

Overview of Python

Overview of PySpark

  • Distributing Data Using Resilient Distributed Datasets Framework
  • Distributing Computation Using Spark API Operators

Setting Up Python with Spark

Setting Up PySpark

Using Amazon Web Services (AWS) EC2 Instances for Spark

Setting Up Databricks

Setting Up the AWS EMR Cluster

Learning the Basics of Python Programming

  • Getting Started with Python
  • Using the Jupyter Notebook
  • Using Variables and Simple Data Types
  • Working with Lists
  • Using if Statements
  • Using User Inputs
  • Working with while Loops
  • Implementing Functions
  • Working with Classes
  • Working with Files and Exceptions
  • Working with Projects, Data, and APIs

Learning the Basics of Spark DataFrame

  • Getting Started with Spark DataFrames
  • Implementing Basic Operations with Spark
  • Using Groupby and Aggregate Operations
  • Working with Timestamps and Dates

Working on a Spark DataFrame Project Exercise

Understanding Machine Learning with MLlib

Working with MLlib, Spark, and Python for Machine Learning

Understanding Regressions

  • Learning Linear Regression Theory
  • Implementing a Regression Evaluation Code
  • Working on a Sample Linear Regression Exercise
  • Learning Logistic Regression Theory
  • Implementing a Logistic Regression Code
  • Working on a Sample Logistic Regression Exercise

Understanding Random Forests and Decision Trees

  • Learning Tree Methods Theory
  • Implementing Decision Trees and Random Forest Codes
  • Working on a Sample Random Forest Classification Exercise

Working with K-means Clustering

  • Understanding K-means Clustering Theory
  • Implementing a K-means Clustering Code
  • Working on a Sample Clustering Exercise

Working with Recommender Systems

Implementing Natural Language Processing

  • Understanding Natural Language Processing (NLP)
  • Overview of NLP Tools
  • Working on a Sample NLP Exercise

Streaming with Spark on Python

  • Overview Streaming with Spark
  • Sample Spark Streaming Exercise

Closing Remarks

Python, Spark, and Hadoop for Big Data Training Course

Duration

21 hours (usually 3 days including breaks)

Requirements

  • Experience with Spark and Hadoop
  • Python programming experience

Audience

  • Data scientists
  • Developers

Overview

Python is a scalable, flexible, and widely used programming language for data science and machine learning. Spark is a data processing engine used in querying, analyzing, and transforming big data, while Hadoop is a software library framework for large-scale data storage and processing.

This instructor-led, live training (online or onsite) is aimed at developers who wish to use and integrate Spark, Hadoop, and Python to process, analyze, and transform large and complex data sets.

By the end of this training, participants will be able to:

  • Set up the necessary environment to start processing big data with Spark, Hadoop, and Python.
  • Understand the features, core components, and architecture of Spark and Hadoop.
  • Learn how to integrate Spark, Hadoop, and Python for big data processing.
  • Explore the tools in the Spark ecosystem (Spark MlLib, Spark Streaming, Kafka, Sqoop, Kafka, and Flume).
  • Build collaborative filtering recommendation systems similar to Netflix, YouTube, Amazon, Spotify, and Google.
  • Use Apache Mahout to scale machine learning algorithms.

Format of the Course

  • Interactive lecture and discussion.
  • Lots of exercises and practice.
  • Hands-on implementation in a live-lab environment.

Course Customization Options

  • To request a customized training for this course, please contact us to arrange.

Course Outline

Introduction

  • Overview of Spark and Hadoop features and architecture
  • Understanding big data
  • Python programming basics

Getting Started

  • Setting up Python, Spark, and Hadoop
  • Understanding data structures in Python
  • Understanding PySpark API
  • Understanding HDFS and MapReduce

Integrating Spark and Hadoop with Python

  • Implementing Spark RDD in Python
  • Processing data using MapReduce
  • Creating distributed datasets in HDFS

Machine Learning with Spark MLlib

Processing Big Data with Spark Streaming

Working with Recommender Systems

Working with Kafka, Sqoop, Kafka, and Flume

Apache Mahout with Spark and Hadoop

Troubleshooting

Summary and Next Steps