Unified Batch and Stream Processing with Apache Beam Training Course

Duration

14 hours (usually 2 days including breaks)

Requirements

  • Experience with Python Programming.
  • Experience with the Linux command line.

Audience

  • Developers

Overview

Apache Beam is an open source, unified programming model for defining and executing parallel data processing pipelines. It’s power lies in its ability to run both batch and streaming pipelines, with execution being carried out by one of Beam’s supported distributed processing back-ends: Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. Apache Beam is useful for ETL (Extract, Transform, and Load) tasks such as moving data between different storage media and data sources, transforming data into a more desirable format, and loading data onto a new system.

In this instructor-led, live training (onsite or remote), participants will learn how to implement the Apache Beam SDKs in a Java or Python application that defines a data processing pipeline for decomposing a big data set into smaller chunks for independent, parallel processing.

By the end of this training, participants will be able to:

  • Install and configure Apache Beam.
  • Use a single programming model to carry out both batch and stream processing from withing their Java or Python application.
  • Execute pipelines across multiple environments.

Format of the Course

  • Part lecture, part discussion, exercises and heavy hands-on practice

Note

  • This course will be available Scala in the future. Please contact us to arrange.

Course Outline

Introduction

  • Apache Beam vs MapReduce, Spark Streaming, Kafka Streaming, Storm and Flink

Installing and Configuring Apache Beam

Overview of Apache Beam Features and Architecture

  • Beam Model, SDKs, Beam Pipeline Runners
  • Distributed processing back-ends

Understanding the Apache Beam Programming Model

  • How a pipeline is executed

Running a sample pipeline

  • Preparing a WordCount pipeline
  • Executing the Pipeline locally

Designing a Pipeline

  • Planning the structure, choosing the transforms, and determining the input and output methods

Creating the Pipeline

  • Writing the driver program and defining the pipeline
  • Using Apache Beam classes
  • Data sets, transforms, I/O, data encoding, etc.

Executing the Pipeline

  • Executing the pipeline locally, on remote machines, and on a public cloud
  • Choosing a runner
  • Runner-specific configurations

Testing and Debugging Apache Beam

  • Using type hints to emulate static typing
  • Managing Python Pipeline Dependencies

Processing Bounded and Unbounded Datasets

  • Windowing and Triggers

Making Your Pipelines Reusable and Maintainable

Create New Data Sources and Sinks

  • Apache Beam Source and Sink API

Integrating Apache Beam with other Big Data Systems

  • Apache Hadoop, Apache Spark, Apache Kafka

Troubleshooting

Summary and Conclusion

Stream Processing with Kafka Streams Training Course

Duration

7 hours (usually 1 day including breaks)

Requirements

  • An understanding of Apache Kafka
  • Java programming experience

Overview

Kafka Streams is a client-side library for building applications and microservices whose data is passed to and from a Kafka messaging system. Traditionally, Apache Kafka has relied on Apache Spark or Apache Storm to process data between message producers and consumers. By calling the Kafka Streams API from within an application, data can be processed directly within Kafka, bypassing the need for sending the data to a separate cluster for processing.

In this instructor-led, live training, participants will learn how to integrate Kafka Streams into a set of sample Java applications that pass data to and from Apache Kafka for stream processing.

By the end of this training, participants will be able to:

  • Understand Kafka Streams features and advantages over other stream processing frameworks
  • Process stream data directly within a Kafka cluster
  • Write a Java or Scala application or microservice that integrates with Kafka and Kafka Streams
  • Write concise code that transforms input Kafka topics into output Kafka topics
  • Build, package and deploy the application

Audience

  • Developers

Format of the course

  • Part lecture, part discussion, exercises and heavy hands-on practice

Notes

  • To request a customized training for this course, please contact us to arrange

Course Outline

Introduction

  • Kafka vs Spark, Flink, and Storm

Overview of Kafka Streams Features

  • Stateful and stateless processing, event-time processing, DSL, event-time based windowing operations, etc.

Case Study: Kafka Streams API for Predictive Budgeting

Setting up the Development Environment

Creating a Streams Application

Starting the Kafka Cluster

Preparing the Topics and Input Data

Options for Processing Stream Data

  • High-level Kafka Streams DSL
  • Lower-level Processor

Transforming the Input Data

Inspecting the Output Data

Stopping the Kafka Cluster

Options for Deploying the Application

  • Classic ops tools (Puppet, Chef and Salt)
  • Docker
  • WAR file

Troubleshooting

Summary and Conclusion

Real-Time Stream Processing with MapR Training Course

Duration

7 hours (usually 1 day including breaks)

Requirements

  • An understanding of Big Data concepts
  • An understanding of Hadoop concepts
  • Java programming experience
  • Comfortable using a Linux command line

Overview

In this instructor-led, live training, participants will learn the core concepts behind MapR Stream Architecture as they develop a real-time streaming application.

By the end of this training, participants will be able to build producer and consumer applications for real-time stream data procesing.

Audience

  • Developers
  • Administrators

Format of the course

  • Part lecture, part discussion, exercises and heavy hands-on practice

Note

  • To request a customized training for this course, please contact us to arrange.

Course Outline

Introduction

Overview of MapR Streams Architecture

MapR Stream Core Components

Understanding How Messages Are Managed in MapR Streams

Understanding Producers and Consumers

Developing a MapR Streams Application

  • Streams, Producer, Consumer
  • Using the Kafka Java API

Working with Properties and Options

Summary and Conclusion