Hadoop – Bluechip AI Asia, AI Development Company

Apache Ambari: Efficiently Manage Hadoop Clusters Training Course

Posted on September 29, 2023 by admin

Duration

21 hours (usually 3 days including breaks)

Requirements

Linux experience
Knowledge of database concepts and practices
Knowledge of Hadoop infrastructure and practices

Overview

Apache Ambari is an open-source management platform for provisioning, managing, monitoring and securing Apache Hadoop clusters.

In this instructor-led live training participants will learn the management tools and practices provided by Ambari to successfully manage Hadoop clusters.

By the end of this training, participants will be able to:

Set up a live Big Data cluster using Ambari
Apply Ambari’s advanced features and functionalities to various use cases
Seamlessly add and remove nodes as needed
Improve a Hadoop cluster’s performance through tuning and tweaking

Audience

DevOps
System Administrators
DBAs
Hadoop testing professionals

Format of the course

Part lecture, part discussion, exercises and heavy hands-on practice

Course Outline

To request a customized course outline for this training, please contact us.

Moving Data from MySQL to Hadoop with Sqoop Training Course

Posted on September 29, 2023 by admin

Duration

14 hours (usually 2 days including breaks)

Requirements

An understanding of big data concepts (HDFS, Hive, etc.)
An understanding of relational databases (MySQL, etc.)
Experience with the Linux command line

Overview

Sqoop is an open source software tool for transfering data between Hadoop and relational databases or mainframes. It can be used to import data from a relational database management system (RDBMS) such as MySQL or Oracle or a mainframe into the Hadoop Distributed File System (HDFS). Thereafter, the data can be transformed in Hadoop MapReduce, and then re-exported back into an RDBMS.

In this instructor-led, live training, participants will learn how to use Sqoop to import data from a traditional relational database to Hadoop storage such HDFS or Hive and vice versa.

By the end of this training, participants will be able to:

Install and configure Sqoop
Import data from MySQL to HDFS and Hive
Import data from HDFS and Hive to MySQL

Audience

System administrators
Data engineers

Format of the Course

Part lecture, part discussion, exercises and heavy hands-on practice

Note

To request a customized training for this course, please contact us to arrange.

Course Outline

Introduction

Moving data from legacy data stores to Hadoop

Installing and Configuring Sqoop

Overview of Sqoop Features and Architecture

Importing Data from MySQL to HDFS

Importing Data from MySQL to Hive

Transforming Data in Hadoop

Importing Data from HDFS to MySQL

Importing Data from Hive to MySQL

Importing Incrementally with Sqoop Jobs

Troubleshooting

Summary and Conclusion

Hadoop with Python Training Course

Posted on September 22, 2023 by admin

Duration

28 hours (usually 4 days including breaks)

Requirements

Experience with Python programming
Basic familiarity with Hadoop

Overview

Hadoop is a popular Big Data processing framework. Python is a high-level programming language famous for its clear syntax and code readibility.

In this instructor-led, live training, participants will learn how to work with Hadoop, MapReduce, Pig, and Spark using Python as they step through multiple examples and use cases.

By the end of this training, participants will be able to:

Understand the basic concepts behind Hadoop, MapReduce, Pig, and Spark
Use Python with Hadoop Distributed File System (HDFS), MapReduce, Pig, and Spark
Use Snakebite to programmatically access HDFS within Python
Use mrjob to write MapReduce jobs in Python
Write Spark programs with Python
Extend the functionality of pig using Python UDFs
Manage MapReduce jobs and Pig scripts using Luigi

Audience

Developers
IT Professionals

Format of the course

Part lecture, part discussion, exercises and heavy hands-on practice

Course Outline

Introduction

Understanding Hadoop’s Architecture and Key Concepts

Understanding the Hadoop Distributed File System (HDFS)

Overview of HDFS and its Architectural Design
Interacting with HDFS
Performing Basic File Operations on HDFS
Overview of HDFS Command Reference
Overview of Snakebite
Installing Snakebite
Using the Snakebite Client Library
Using the CLI Client

Learning the MapReduce Programming Model with Python

Overview of the MapReduce Programming Model
Understanding Data Flow in the MapReduce Framework
- Map
- Shuffle and Sort
- Reduce
Using the Hadoop Streaming Utility
- Understanding How the Hadoop Streaming Utility Works
- Demo: Implementing the WordCount Application on Python
Using the mrjob Library
- Overview of mrjob
- Installing mrjob
- Demo: Implementing the WordCount Algorithm Using mrjob
- Understanding How a MapReduce Job Written with the mrjob Library Works
- Executing a MapReduce Application with mrjob
- Hands-on: Computing Top Salaries Using mrjob

Learning Pig with Python

Overview of Pig
Demo: Implementing the WordCount Algorithm in Pig
Configuring and Running Pig Scripts and Pig Statements
- Using the Pig Execution Modes
- Using the Pig Interactive Mode
- Using the Pic Batch Mode
Understanding the Basic Concepts of the Pig Latin Language
- Using Statements
- Loading Data
- Transforming Data
- Storing Data
Extending Pig’s Functionality with Python UDFs
- Registering a Python UDF File
- Demo: A Simple Python UDF
- Demo: String Manipulation Using Python UDF
- Hands-on: Calculating the 10 Most Recent Movies Using Python UDF

Using Spark and PySpark

Overview of Spark
Demo: Implementing the WordCount Algorithm in PySpark
Overview of PySpark
- Using an Interactive Shell
- Implementing Self-Contained Applications
Working with Resilient Distributed Datasets (RDDs)
- Creating RDDs from a Python Collection
- Creating RDDs from Files
- Implementing RDD Transformations
- Implementing RDD Actions
Hands-on: Implementing a Text Search Program for Movie Titles with PySpark

Managing Workflow with Python

Overview of Apache Oozie and Luigi
Installing Luigi
Understanding Luigi Workflow Concepts
- Tasks
- Targets
- Parameters
Demo: Examining a Workflow that Implements the WordCount Algorithm
Working with Hadoop Workflows that Control MapReduce and Pig Jobs
- Using Luigi’s Configuration Files
- Working with MapReduce in Luigi
- Working with Pig in Luigi

Summary and Conclusion

Hadoop for Project Managers Training Course

Posted on September 22, 2023 by admin

Duration

14 hours (usually 2 days including breaks)

Requirements

A general understanding of programming
An understanding of databases
Basic knowledge of Linux

Overview

As more and more software and IT projects migrate from local processing and data management to distributed processing and big data storage, Project Managers are finding the need to upgrade their knowledge and skills to grasp the concepts and practices relevant to Big Data projects and opportunities.

This course introduces Project Managers to the most popular Big Data processing framework: Hadoop.

In this instructor-led training in, participants will learn the core components of the Hadoop ecosystem and how these technologies can be used to solve large-scale problems. By learning these foundations, participants will improve their ability to communicate with the developers and implementers of these systems as well as the data scientists and analysts that many IT projects involve.

Audience

Project Managers wishing to implement Hadoop into their existing development or IT infrastructure
Project Managers needing to communicate with cross-functional teams that include big data engineers, data scientists and business analysts

Format of the course

Part lecture, part discussion, exercises and heavy hands-on practice

Course Outline

Introduction

Why and how project teams adopt Hadoop
How it all started
The Project Manager’s role in Hadoop projects

Understanding Hadoop’s Architecture and Key Concepts

HDFS
MapReduce
Other pieces of the Hadoop ecosystem

What Constitutes Big Data?

Different Approaches to Storing Big Data

HDFS (Hadoop Distributed File System) as the Foundation

How Big Data is Processed

The power of distributed processing

Processing Data with MapReduce

How data is picked apart step by step

The Role of Clustering in Large-Scale Distributed Processing

Architectural overview
Clustering approaches

Clustering Your Data and Processes with YARN

The Role of Non-Relational Database in Big Data Storage

Working with Hadoop’s Non-Relational Database: HBase

Data Warehousing Architectural Overview

Managing Your Data Warehouse with Hive

Running Hadoop from Shell-Scripts

Working with Hadoop Streaming

Other Hadoop Tools and Utilities

Getting Started on a Hadoop Project

Demystifying complexity

Migrating an Existing Project to Hadoop

Infrastructure considerations
Scaling beyond your allocated resources

Hadoop Project Stakeholders and Their Toolkits

Developers, data scientists, business analysts and project managers

Hadoop as a Foundation for New Technologies and Approaches

Closing Remarks

Hadoop for Developers and Administrators Training Course

Posted on September 22, 2023 by admin

Duration

21 hours (usually 3 days including breaks)

Overview

Hadoop is the most popular Big Data processing framework.

Course Outline

Module 1. Introduction to Hadoop

The Hadoop Distributed File System (HDFS)
The Read Path and The Write Path
Managing Filesystem Metadata
The Namenode and the Datanode
The Namenode High Availability
Namenode Federation
The Command-Line Tools
Understanding REST Support

Module 2. Introduction to MapReduce

Analyzing the Data with Hadoop
Map and Reduce Pattern
Java MapReduce
Scaling Out
Data Flow
Developing Combiner Functions
Running a Distributed MapReduce Job

Module 3. Planning a Hadoop Cluster

Picking a Distribution and Version of Hadoop
Versions and Features
Hardware Selection
Master and Worker Hardware Selection
Cluster Sizing
Operating System Selection and Preparation
Deployment Layout
Setting up Users, Groups, and Privileges
Disk Configuration
Network Design

Module 4. Installation and Configuration

Installing Hadoop
Configuration: An Overview
The Hadoop XML Configuration Files
Environment Variables and Shell Scripts
Logging Configuration
Managing HDFS
Optimization and Tuning
Formatting the Namenode
Creating a /tmp Directory
Thinking Namenode High Availability
The Fencing Options
Automatic Failover Configuration
Format and Bootstrap the Namenodes
Namenode Federation

Module 5. Understanding Hadoop I/O

Data Integrity in HDFS
Understanding Codecs
Compression and Input Splits
Using Compression in MapReduce
The Serialization mechanism
File-Based Data Structures
The SequenceFile format
Other File Formats and Column-Oriented Formats

Module 6. Developing a MapReduce Application

The Configuration API
Setting Up the Development Environment
Managing Configuration
GenericOptionsParser, Tool, and ToolRunner
Writing a Unit Test with MRUnit
The Mapper and Reducer
Running Locally on Test Data
Testing the Driver
Running on a Cluster
Packaging and Launching a Job
The MapReduce Web UI
Tuning a Job

Module 7. Identity, Authentication, and Authorization

Managing Identity
Kerberos and Hadoop
Understanding Authorization

Module 8. Resource Management

What Is Resource Management?
HDFS Quotas
MapReduce Schedulers
Anatomy of a YARN Application Run
Resource Requests
Application Lifespan
YARN Compared to MapReduce 1
Scheduling in YARN
Scheduler Options
Capacity Scheduler Configuration
Fair Scheduler Configuration
Delay Scheduling
Dominant Resource Fairness

Module 9. MapReduce Types and Formats

MapReduce Types
The Default MapReduce Job
Defining the Input Formats
Managing Input Splits and Records
Text Input and Binary Input
Managing Multiple Inputs
Database Input (and Output)
Output Formats
Text Output and Binary Output
Managing Multiple Outputs
The Database Output

Module 10. Using MapReduce Features

Using Counters
Reading Built-in Counters
User-Defined Java Counters
Understanding Sorting
Using the Distributed Cache

Module 11. Cluster Maintenance and Troubleshooting

Managing Hadoop Processes
Starting and Stopping Processes with Init Scripts
Starting and Stopping Processes Manually
HDFS Maintenance Tasks
Adding a Datanode
Decommissioning a Datanode
Checking Filesystem Integrity with fsck
Balancing HDFS Block Data
Dealing with a Failed Disk
MapReduce Maintenance Tasks
Killing a MapReduce Job
Killing a MapReduce Task
Managing Resource Exhaustion

Module 12. Monitoring

The available Hadoop Metrics
The role of SNMP
Health Monitoring
Host-Level Checks
HDFS Checks
MapReduce Checks

Module 13. Backup and Recovery

Data Backup
Distributed Copy (distcp)
Parallel Data Ingestion
Namenode Metadata

Hadoop for Business Analysts Training Course

Posted on September 22, 2023 by admin

Duration

21 hours (usually 3 days including breaks)

Requirements

programming background with databases / SQL
basic knowledge of Linux (be able to navigate Linux command line, editing files with vi / nano)

Lab environment

Zero Install : There is no need to install hadoop software on students’ machines! A working Hadoop cluster will be provided for students.

Students will need the following

an SSH client (Linux and Mac already have ssh clients, for Windows Putty is recommended)
a browser to access the cluster. We recommend Firefox browser with FoxyProxy extension installed

Overview

Apache Hadoop is the most popular framework for processing Big Data. Hadoop provides rich and deep analytics capability, and it is making in-roads in to tradional BI analytics world. This course will introduce an analyst to the core components of Hadoop eco system and its analytics

Audience

Business Analysts

Duration

three days

Format

Lectures and hands on labs.

Course Outline

Section 1: Introduction to Hadoop
- hadoop history, concepts
- eco system
- distributions
- high level architecture
- hadoop myths
- hadoop challenges
- hardware / software
- Labs : first look at Hadoop
Section 2: HDFS Overview
- concepts (horizontal scaling, replication, data locality, rack awareness)
- architecture (Namenode, Secondary namenode, Data node)
- data integrity
- future of HDFS : Namenode HA, Federation
- labs : Interacting with HDFS
Section 3 : Map Reduce Overview
- mapreduce concepts
- daemons : jobtracker / tasktracker
- phases : driver, mapper, shuffle/sort, reducer
- Thinking in map reduce
- Future of mapreduce (yarn)
- labs : Running a Map Reduce program
Section 4 : Pig
- pig vs java map reduce
- pig latin language
- user defined functions
- understanding pig job flow
- basic data analysis with Pig
- complex data analysis with Pig
- multi datasets with Pig
- advanced concepts
- lab : writing pig scripts to analyze / transform data
Section 5: Hive
- hive concepts
- architecture
- SQL support in Hive
- data types
- table creation and queries
- Hive data management
- partitions & joins
- text analytics
- labs (multiple) : creating Hive tables and running queries, joins , using partitions, using text analytics functions
Section 6: BI Tools for Hadoop
- BI tools and Hadoop
- Overview of current BI tools landscape
- Choosing the best tool for the job

Hadoop For Administrators Training Course

Posted on September 22, 2023 by admin

Duration

21 hours (usually 3 days including breaks)

Requirements

comfortable with basic Linux system administration
basic scripting skills

Knowledge of Hadoop and Distributed Computing is not required, but will be introduced and explained in the course.

Lab environment

Zero Install : There is no need to install hadoop software on students’ machines! A working hadoop cluster will be provided for students.

Students will need the following

an SSH client (Linux and Mac already have ssh clients, for Windows Putty is recommended)
a browser to access the cluster. We recommend Firefox browser with FoxyProxy extension installed

Overview

Apache Hadoop is the most popular framework for processing Big Data on clusters of servers. In this three (optionally, four) days course, attendees will learn about the business benefits and use cases for Hadoop and its ecosystem, how to plan cluster deployment and growth, how to install, maintain, monitor, troubleshoot and optimize Hadoop. They will also practice cluster bulk data load, get familiar with various Hadoop distributions, and practice installing and managing Hadoop ecosystem tools. The course finishes off with discussion of securing cluster with Kerberos.

“…The materials were very well prepared and covered thoroughly. The Lab was very helpful and well organized”
— Andrew Nguyen, Principal Integration DW Engineer, Microsoft Online Advertising

Audience

Hadoop administrators

Format

Lectures and hands-on labs, approximate balance 60% lectures, 40% labs.

Course Outline

Introduction
- Hadoop history, concepts
- Ecosystem
- Distributions
- High level architecture
- Hadoop myths
- Hadoop challenges (hardware / software)
- Labs: discuss your Big Data projects and problems
Planning and installation
- Selecting software, Hadoop distributions
- Sizing the cluster, planning for growth
- Selecting hardware and network
- Rack topology
- Installation
- Multi-tenancy
- Directory structure, logs
- Benchmarking
- Labs: cluster install, run performance benchmarks
HDFS operations
- Concepts (horizontal scaling, replication, data locality, rack awareness)
- Nodes and daemons (NameNode, Secondary NameNode, HA Standby NameNode, DataNode)
- Health monitoring
- Command-line and browser-based administration
- Adding storage, replacing defective drives
- Labs: getting familiar with HDFS command lines
Data ingestion
- Flume for logs and other data ingestion into HDFS
- Sqoop for importing from SQL databases to HDFS, as well as exporting back to SQL
- Hadoop data warehousing with Hive
- Copying data between clusters (distcp)
- Using S3 as complementary to HDFS
- Data ingestion best practices and architectures
- Labs: setting up and using Flume, the same for Sqoop
MapReduce operations and administration
- Parallel computing before mapreduce: compare HPC vs Hadoop administration
- MapReduce cluster loads
- Nodes and Daemons (JobTracker, TaskTracker)
- MapReduce UI walk through
- Mapreduce configuration
- Job config
- Optimizing MapReduce
- Fool-proofing MR: what to tell your programmers
- Labs: running MapReduce examples
YARN: new architecture and new capabilities
- YARN design goals and implementation architecture
- New actors: ResourceManager, NodeManager, Application Master
- Installing YARN
- Job scheduling under YARN
- Labs: investigate job scheduling
Advanced topics
- Hardware monitoring
- Cluster monitoring
- Adding and removing servers, upgrading Hadoop
- Backup, recovery and business continuity planning
- Oozie job workflows
- Hadoop high availability (HA)
- Hadoop Federation
- Securing your cluster with Kerberos
- Labs: set up monitoring
Optional tracks
- Cloudera Manager for cluster administration, monitoring, and routine tasks; installation, use. In this track, all exercises and labs are performed within the Cloudera distribution environment (CDH5)
- Ambari for cluster administration, monitoring, and routine tasks; installation, use. In this track, all exercises and labs are performed within the Ambari cluster manager and Hortonworks Data Platform (HDP 2.0)

Advanced Hadoop for Developers Training Course

Posted on September 22, 2023 by admin

Duration

21 hours (usually 3 days including breaks)

Requirements

comfortable with Java programming language (most programming exercises are in java)
comfortable in Linux environment (be able to navigate Linux command line, edit files using vi / nano)
a working knowledge of Hadoop.

Lab environment

Zero Install: There is no need to install hadoop software on students’ machines! A working hadoop cluster will be provided for students.

Students will need the following

an SSH client (Linux and Mac already have ssh clients, for Windows Putty is recommended)
a browser to access the cluster. We recommend Firefox browser

Overview

Apache Hadoop is one of the most popular frameworks for processing Big Data on clusters of servers. This course delves into data management in HDFS, advanced Pig, Hive, and HBase. These advanced programming techniques will be beneficial to experienced Hadoop developers.

Audience: developers

Duration: three days

Format: lectures (50%) and hands-on labs (50%).

Course Outline

Section 1: Data Management in HDFS

Various Data Formats (JSON / Avro / Parquet)
Compression Schemes
Data Masking
Labs : Analyzing different data formats; enabling compression

Section 2: Advanced Pig

User-defined Functions
Introduction to Pig Libraries (ElephantBird / Data-Fu)
Loading Complex Structured Data using Pig
Pig Tuning
Labs : advanced pig scripting, parsing complex data types

Section 3 : Advanced Hive

User-defined Functions
Compressed Tables
Hive Performance Tuning
Labs : creating compressed tables, evaluating table formats and configuration

Section 4 : Advanced HBase

Advanced Schema Modelling
Compression
Bulk Data Ingest
Wide-table / Tall-table comparison
HBase and Pig
HBase and Hive
HBase Performance Tuning
Labs : tuning HBase; accessing HBase data from Pig & Hive; Using Phoenix for data modeling

Hadoop for Developers (4 days) Training Course

Posted on September 22, 2023 by admin

Duration

28 hours (usually 4 days including breaks)

Requirements

comfortable with Java programming language (most programming exercises are in java)
comfortable in Linux environment (be able to navigate Linux command line, edit files using vi / nano)

Lab environment

Zero Install : There is no need to install hadoop software on students’ machines! A working hadoop cluster will be provided for students.

Students will need the following

an SSH client (Linux and Mac already have ssh clients, for Windows Putty is recommended)
a browser to access the cluster. We recommend Firefox browser

Overview

Apache Hadoop is the most popular framework for processing Big Data on clusters of servers. This course will introduce a developer to various components (HDFS, MapReduce, Pig, Hive and HBase) Hadoop ecosystem.

Course Outline

Section 1: Introduction to Hadoop

hadoop history, concepts
eco system
distributions
high level architecture
hadoop myths
hadoop challenges
hardware / software
lab : first look at Hadoop

Section 2: HDFS

Design and architecture
concepts (horizontal scaling, replication, data locality, rack awareness)
Daemons : Namenode, Secondary namenode, Data node
communications / heart-beats
data integrity
read / write path
Namenode High Availability (HA), Federation
labs : Interacting with HDFS

Section 3 : Map Reduce

concepts and architecture
daemons (MRV1) : jobtracker / tasktracker
phases : driver, mapper, shuffle/sort, reducer
Map Reduce Version 1 and Version 2 (YARN)
Internals of Map Reduce
Introduction to Java Map Reduce program
labs : Running a sample MapReduce program

Section 4 : Pig

pig vs java map reduce
pig job flow
pig latin language
ETL with Pig
Transformations & Joins
User defined functions (UDF)
labs : writing Pig scripts to analyze data

Section 5: Hive

architecture and design
data types
SQL support in Hive
Creating Hive tables and querying
partitions
joins
text processing
labs : various labs on processing data with Hive

Section 6: HBase

concepts and architecture
hbase vs RDBMS vs cassandra
HBase Java API
Time series data on HBase
schema design
labs : Interacting with HBase using shell; programming in HBase Java API ; Schema design exercise

Hadoop and Spark for Administrators Training Course

Posted on September 22, 2023 by admin

Duration

35 hours (usually 5 days including breaks)

Requirements

System administration experience
Experience with Linux command line
An understanding of big data concepts

Audience

System administrators
DBAs

Overview

Apache Hadoop is a popular data processing framework for processing large data sets across many computers.

This instructor-led, live training (online or onsite) is aimed at system administrators who wish to learn how to set up, deploy and manage Hadoop clusters within their organization.

By the end of this training, participants will be able to:

Install and configure Apache Hadoop.
Understand the four major components in the Hadoop ecoystem: HDFS, MapReduce, YARN, and Hadoop Common.
Use Hadoop Distributed File System (HDFS) to scale a cluster to hundreds or thousands of nodes.
Set up HDFS to operate as storage engine for on-premise Spark deployments.
Set up Spark to access alternative storage solutions such as Amazon S3 and NoSQL database systems such as Redis, Elasticsearch, Couchbase, Aerospike, etc.
Carry out administrative tasks such as provisioning, management, monitoring and securing an Apache Hadoop cluster.

Format of the Course

Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.

Course Customization Options

To request a customized training for this course, please contact us to arrange.

Course Outline

Introduction

Introduction to Cloud Computing and Big Data solutions
Overview of Apache Hadoop Features and Architecture

Setting up Hadoop

Planning a Hadoop cluster (on-premise, cloud, etc.)
Selecting the OS and Hadoop distribution
Provisioning resources (hardware, network, etc.)
Downloading and installing the software
Sizing the cluster for flexibility

Working with HDFS

Understanding the Hadoop Distributed File System (HDFS)
Overview of HDFS Command Reference
Accessing HDFS
Performing Basic File Operations on HDFS
Using S3 as a complement to HDFS

Overview of the MapReduce

Understanding Data Flow in the MapReduce Framework
Map, Shuffle, Sort and Reduce
Demo: Computing Top Salaries

Working with YARN

Understanding resource management in Hadoop
Working with ResourceManager, NodeManager, Application Master
Scheduling jobs under YARN
Scheduling for large numbers of nodes and clusters
Demo: Job scheduling

Integrating Hadoop with Spark

Setting up storage for Spark (HDFS, Amazon, S3, NoSQL, etc.)
Understanding Resilient Distributed Datasets (RDDs)
Creating an RDD
Implementing RDD Transformations
Demo: Implementing a Text Search Program for Movie Titles

Managing a Hadoop Cluster

Monitoring Hadoop
Securing a Hadoop cluster
Adding and removing nodes
Running a performance benchmark
Tuning a Hadoop cluster to optimizing performance
Backup, recovery and business continuity planning
Ensuring high availability (HA)

Upgrading and Migrating a Hadoop Cluster

Assessing workload requirements
Upgrading Hadoop
Moving from on-premise to cloud and vice-versa
Recovering from failures

Troubleshooting

Summary and Conclusion