Hadoop and Spark for Administrators Training Course – Bluechip AI Asia, AI Development Company

Duration

35 hours (usually 5 days including breaks)

Requirements

System administration experience
Experience with Linux command line
An understanding of big data concepts

Audience

System administrators
DBAs

Overview

Apache Hadoop is a popular data processing framework for processing large data sets across many computers.

This instructor-led, live training (online or onsite) is aimed at system administrators who wish to learn how to set up, deploy and manage Hadoop clusters within their organization.

By the end of this training, participants will be able to:

Install and configure Apache Hadoop.
Understand the four major components in the Hadoop ecoystem: HDFS, MapReduce, YARN, and Hadoop Common.
Use Hadoop Distributed File System (HDFS) to scale a cluster to hundreds or thousands of nodes.
Set up HDFS to operate as storage engine for on-premise Spark deployments.
Set up Spark to access alternative storage solutions such as Amazon S3 and NoSQL database systems such as Redis, Elasticsearch, Couchbase, Aerospike, etc.
Carry out administrative tasks such as provisioning, management, monitoring and securing an Apache Hadoop cluster.

Format of the Course

Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.

Course Customization Options

To request a customized training for this course, please contact us to arrange.

Course Outline

Introduction

Introduction to Cloud Computing and Big Data solutions
Overview of Apache Hadoop Features and Architecture

Setting up Hadoop

Planning a Hadoop cluster (on-premise, cloud, etc.)
Selecting the OS and Hadoop distribution
Provisioning resources (hardware, network, etc.)
Downloading and installing the software
Sizing the cluster for flexibility

Working with HDFS

Understanding the Hadoop Distributed File System (HDFS)
Overview of HDFS Command Reference
Accessing HDFS
Performing Basic File Operations on HDFS
Using S3 as a complement to HDFS

Overview of the MapReduce

Understanding Data Flow in the MapReduce Framework
Map, Shuffle, Sort and Reduce
Demo: Computing Top Salaries

Working with YARN

Understanding resource management in Hadoop
Working with ResourceManager, NodeManager, Application Master
Scheduling jobs under YARN
Scheduling for large numbers of nodes and clusters
Demo: Job scheduling

Integrating Hadoop with Spark

Setting up storage for Spark (HDFS, Amazon, S3, NoSQL, etc.)
Understanding Resilient Distributed Datasets (RDDs)
Creating an RDD
Implementing RDD Transformations
Demo: Implementing a Text Search Program for Movie Titles

Managing a Hadoop Cluster

Monitoring Hadoop
Securing a Hadoop cluster
Adding and removing nodes
Running a performance benchmark
Tuning a Hadoop cluster to optimizing performance
Backup, recovery and business continuity planning
Ensuring high availability (HA)

Upgrading and Migrating a Hadoop Cluster

Assessing workload requirements
Upgrading Hadoop
Moving from on-premise to cloud and vice-versa
Recovering from failures

Troubleshooting

Summary and Conclusion

Duration

Requirements

Overview

Course Outline

Leave a Reply Cancel reply