Advanced Hadoop for Developers Training Course – Bluechip AI Asia, AI Development Company

Duration

21 hours (usually 3 days including breaks)

Requirements

comfortable with Java programming language (most programming exercises are in java)
comfortable in Linux environment (be able to navigate Linux command line, edit files using vi / nano)
a working knowledge of Hadoop.

Lab environment

Zero Install: There is no need to install hadoop software on students’ machines! A working hadoop cluster will be provided for students.

Students will need the following

an SSH client (Linux and Mac already have ssh clients, for Windows Putty is recommended)
a browser to access the cluster. We recommend Firefox browser

Overview

Apache Hadoop is one of the most popular frameworks for processing Big Data on clusters of servers. This course delves into data management in HDFS, advanced Pig, Hive, and HBase. These advanced programming techniques will be beneficial to experienced Hadoop developers.

Audience: developers

Duration: three days

Format: lectures (50%) and hands-on labs (50%).

Course Outline

Section 1: Data Management in HDFS

Various Data Formats (JSON / Avro / Parquet)
Compression Schemes
Data Masking
Labs : Analyzing different data formats; enabling compression

Section 2: Advanced Pig

User-defined Functions
Introduction to Pig Libraries (ElephantBird / Data-Fu)
Loading Complex Structured Data using Pig
Pig Tuning
Labs : advanced pig scripting, parsing complex data types

Section 3 : Advanced Hive

User-defined Functions
Compressed Tables
Hive Performance Tuning
Labs : creating compressed tables, evaluating table formats and configuration

Section 4 : Advanced HBase

Advanced Schema Modelling
Compression
Bulk Data Ingest
Wide-table / Tall-table comparison
HBase and Pig
HBase and Hive
HBase Performance Tuning
Labs : tuning HBase; accessing HBase data from Pig & Hive; Using Phoenix for data modeling