Hadoop for Developers (4 days) Training Course – Bluechip AI Asia, AI Development Company

Duration

28 hours (usually 4 days including breaks)

Requirements

comfortable with Java programming language (most programming exercises are in java)
comfortable in Linux environment (be able to navigate Linux command line, edit files using vi / nano)

Lab environment

Zero Install : There is no need to install hadoop software on students’ machines! A working hadoop cluster will be provided for students.

Students will need the following

an SSH client (Linux and Mac already have ssh clients, for Windows Putty is recommended)
a browser to access the cluster. We recommend Firefox browser

Overview

Apache Hadoop is the most popular framework for processing Big Data on clusters of servers. This course will introduce a developer to various components (HDFS, MapReduce, Pig, Hive and HBase) Hadoop ecosystem.

Course Outline

Section 1: Introduction to Hadoop

hadoop history, concepts
eco system
distributions
high level architecture
hadoop myths
hadoop challenges
hardware / software
lab : first look at Hadoop

Section 2: HDFS

Design and architecture
concepts (horizontal scaling, replication, data locality, rack awareness)
Daemons : Namenode, Secondary namenode, Data node
communications / heart-beats
data integrity
read / write path
Namenode High Availability (HA), Federation
labs : Interacting with HDFS

Section 3 : Map Reduce

concepts and architecture
daemons (MRV1) : jobtracker / tasktracker
phases : driver, mapper, shuffle/sort, reducer
Map Reduce Version 1 and Version 2 (YARN)
Internals of Map Reduce
Introduction to Java Map Reduce program
labs : Running a sample MapReduce program

Section 4 : Pig

pig vs java map reduce
pig job flow
pig latin language
ETL with Pig
Transformations & Joins
User defined functions (UDF)
labs : writing Pig scripts to analyze data

Section 5: Hive

architecture and design
data types
SQL support in Hive
Creating Hive tables and querying
partitions
joins
text processing
labs : various labs on processing data with Hive

Section 6: HBase

concepts and architecture
hbase vs RDBMS vs cassandra
HBase Java API
Time series data on HBase
schema design
labs : Interacting with HBase using shell; programming in HBase Java API ; Schema design exercise