Duration
21 hours (usually 3 days including breaks)
Overview
Hadoop is the most popular Big Data processing framework.
Course Outline
Module 1. Introduction to Hadoop
- The Hadoop Distributed File System (HDFS)
- The Read Path and The Write Path
- Managing Filesystem Metadata
- The Namenode and the Datanode
- The Namenode High Availability
- Namenode Federation
- The Command-Line Tools
- Understanding REST Support
Module 2. Introduction to MapReduce
- Analyzing the Data with Hadoop
- Map and Reduce Pattern
- Java MapReduce
- Scaling Out
- Data Flow
- Developing Combiner Functions
- Running a Distributed MapReduce Job
Module 3. Planning a Hadoop Cluster
- Picking a Distribution and Version of Hadoop
- Versions and Features
- Hardware Selection
- Master and Worker Hardware Selection
- Cluster Sizing
- Operating System Selection and Preparation
- Deployment Layout
- Setting up Users, Groups, and Privileges
- Disk Configuration
- Network Design
Module 4. Installation and Configuration
- Installing Hadoop
- Configuration: An Overview
- The Hadoop XML Configuration Files
- Environment Variables and Shell Scripts
- Logging Configuration
- Managing HDFS
- Optimization and Tuning
- Formatting the Namenode
- Creating a /tmp Directory
- Thinking Namenode High Availability
- The Fencing Options
- Automatic Failover Configuration
- Format and Bootstrap the Namenodes
- Namenode Federation
Module 5. Understanding Hadoop I/O
- Data Integrity in HDFS
- Understanding Codecs
- Compression and Input Splits
- Using Compression in MapReduce
- The Serialization mechanism
- File-Based Data Structures
- The SequenceFile format
- Other File Formats and Column-Oriented Formats
Module 6. Developing a MapReduce Application
- The Configuration API
- Setting Up the Development Environment
- Managing Configuration
- GenericOptionsParser, Tool, and ToolRunner
- Writing a Unit Test with MRUnit
- The Mapper and Reducer
- Running Locally on Test Data
- Testing the Driver
- Running on a Cluster
- Packaging and Launching a Job
- The MapReduce Web UI
- Tuning a Job
Module 7. Identity, Authentication, and Authorization
- Managing Identity
- Kerberos and Hadoop
- Understanding Authorization
Module 8. Resource Management
- What Is Resource Management?
- HDFS Quotas
- MapReduce Schedulers
- Anatomy of a YARN Application Run
- Resource Requests
- Application Lifespan
- YARN Compared to MapReduce 1
- Scheduling in YARN
- Scheduler Options
- Capacity Scheduler Configuration
- Fair Scheduler Configuration
- Delay Scheduling
- Dominant Resource Fairness
Module 9. MapReduce Types and Formats
- MapReduce Types
- The Default MapReduce Job
- Defining the Input Formats
- Managing Input Splits and Records
- Text Input and Binary Input
- Managing Multiple Inputs
- Database Input (and Output)
- Output Formats
- Text Output and Binary Output
- Managing Multiple Outputs
- The Database Output
Module 10. Using MapReduce Features
- Using Counters
- Reading Built-in Counters
- User-Defined Java Counters
- Understanding Sorting
- Using the Distributed Cache
Module 11. Cluster Maintenance and Troubleshooting
- Managing Hadoop Processes
- Starting and Stopping Processes with Init Scripts
- Starting and Stopping Processes Manually
- HDFS Maintenance Tasks
- Adding a Datanode
- Decommissioning a Datanode
- Checking Filesystem Integrity with fsck
- Balancing HDFS Block Data
- Dealing with a Failed Disk
- MapReduce Maintenance Tasks
- Killing a MapReduce Job
- Killing a MapReduce Task
- Managing Resource Exhaustion
Module 12. Monitoring
- The available Hadoop Metrics
- The role of SNMP
- Health Monitoring
- Host-Level Checks
- HDFS Checks
- MapReduce Checks
Module 13. Backup and Recovery
- Data Backup
- Distributed Copy (distcp)
- Parallel Data Ingestion
- Namenode Metadata