Duration
21 hours (usually 3 days including breaks)
Requirements
- knowledge of SQL
Overview
Cloudera Impala is an open source massively parallel processing (MPP) SQL query engine for Apache Hadoop clusters.
Impala enables users to issue low-latency SQL queries to data stored in Hadoop Distributed File System and Apache Hbase without requiring data movement or transformation.
Audience
This course is aimed at analysts and data scientists performing analysis on data stored in Hadoop via Business Intelligence or SQL tools.
After this course delegates will be able to
- Extract meaningful information from Hadoop clusters with Impala.
- Write specific programs to facilitate Business Intelligence in Impala SQL Dialect.
- Troubleshoot Impala.
Course Outline
Introduction to Impala
- What is Impala?
- How Impala Differs from Relational Databases
- Limitations and Future Directions
- Using the Impala Shell
- The Impala Daemon, Statestore and Catalogue service
Loading Impala
- Explore a New Impala Instance
- Load CSV Data from Local Files
- Point an Impala Table at Existing Data Files
Analyzing Data with Impala
- Describe the Impala Table
- Basic Syntax and Querying
- Data Types
- Filtering, Sorting, and Limiting Results
- Joining and Grouping Data
- Data Loading and Querying Examples
- Improving Impala Performance
- How Impala works with Hadoop file formats
- Hands-On Exercise: Interactive Analysis with Impala
Programming Impala Applications
- Overview of the Impala SQL Dialect
- Overview of Impala Programming Interfaces
Troubleshooting Impala
- Troubleshooting Impala SQL Syntax Issues
- Troubleshooting I/O Capacity Problems
- Impala Web User Interface for Debugging