Apache Arrow for Data Analysis across Disparate Data Sources Training Course

Duration

14 hours (usually 2 days including breaks)

Requirements

A basic undersanding of SQL
Familiarity with Python or R
Some familiarity with Apache Spark

Overview

Apache Arrow is an open-source in-memory data processing framework. It is often used together with other data science tools for accessing disparate data stores for analysis. It integrates well with other technologies such as GPU databases, machine learning libraries and tools, execution engines, and data visualization frameworks.

In this onsite instructor-led, live training, participants will learn how to integrate Apache Arrow with various Data Science frameworks to access data from disparate data sources.

By the end of this training, participants will be able to:

Install and configure Apache Arrow in a distributed clustered environment
Use Apache Arrow to access data from disparate data sources
Use Apache Arrow to bypass the need for constructing and maintaining complex ETL pipelines
Analyze data across disparate data sources without having to consolidate it into a centralized repository

Audience

Data scientists
Data engineers

Format of the Course

Part lecture, part discussion, exercises and heavy hands-on practice

Note

To request a customized training for this course, please contact us to arrange.

Course Outline

Introduction

Apache Arrow vs Parquet

Installing and Configuring Apache Arrow

Overview of Apache Arrow Features and Architecture

Exploring Data with Pandas and Apache Arrow

Exploring Data with Spark and Apache Arrow

Exploring Data with R and Apache Arrow

Exploring Data with MapD and Apache Arrow

Other Data Analysis Integrations

PySpark, Parquet files on S3, and Oracle tables and Elasticsearch indices

Troubleshooting

Summary and Conclusion

Duration

Requirements

Overview

Course Outline

Leave a Reply Cancel reply