Spark – Bluechip AI Asia, AI Development Company

Hadoop and Spark for Administrators Training Course

Posted on September 22, 2023 by admin

Duration

35 hours (usually 5 days including breaks)

Requirements

System administration experience
Experience with Linux command line
An understanding of big data concepts

Audience

System administrators
DBAs

Overview

Apache Hadoop is a popular data processing framework for processing large data sets across many computers.

This instructor-led, live training (online or onsite) is aimed at system administrators who wish to learn how to set up, deploy and manage Hadoop clusters within their organization.

By the end of this training, participants will be able to:

Install and configure Apache Hadoop.
Understand the four major components in the Hadoop ecoystem: HDFS, MapReduce, YARN, and Hadoop Common.
Use Hadoop Distributed File System (HDFS) to scale a cluster to hundreds or thousands of nodes.
Set up HDFS to operate as storage engine for on-premise Spark deployments.
Set up Spark to access alternative storage solutions such as Amazon S3 and NoSQL database systems such as Redis, Elasticsearch, Couchbase, Aerospike, etc.
Carry out administrative tasks such as provisioning, management, monitoring and securing an Apache Hadoop cluster.

Format of the Course

Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.

Course Customization Options

To request a customized training for this course, please contact us to arrange.

Course Outline

Introduction

Introduction to Cloud Computing and Big Data solutions
Overview of Apache Hadoop Features and Architecture

Setting up Hadoop

Planning a Hadoop cluster (on-premise, cloud, etc.)
Selecting the OS and Hadoop distribution
Provisioning resources (hardware, network, etc.)
Downloading and installing the software
Sizing the cluster for flexibility

Working with HDFS

Understanding the Hadoop Distributed File System (HDFS)
Overview of HDFS Command Reference
Accessing HDFS
Performing Basic File Operations on HDFS
Using S3 as a complement to HDFS

Overview of the MapReduce

Understanding Data Flow in the MapReduce Framework
Map, Shuffle, Sort and Reduce
Demo: Computing Top Salaries

Working with YARN

Understanding resource management in Hadoop
Working with ResourceManager, NodeManager, Application Master
Scheduling jobs under YARN
Scheduling for large numbers of nodes and clusters
Demo: Job scheduling

Integrating Hadoop with Spark

Setting up storage for Spark (HDFS, Amazon, S3, NoSQL, etc.)
Understanding Resilient Distributed Datasets (RDDs)
Creating an RDD
Implementing RDD Transformations
Demo: Implementing a Text Search Program for Movie Titles

Managing a Hadoop Cluster

Monitoring Hadoop
Securing a Hadoop cluster
Adding and removing nodes
Running a performance benchmark
Tuning a Hadoop cluster to optimizing performance
Backup, recovery and business continuity planning
Ensuring high availability (HA)

Upgrading and Migrating a Hadoop Cluster

Assessing workload requirements
Upgrading Hadoop
Moving from on-premise to cloud and vice-versa
Recovering from failures

Troubleshooting

Summary and Conclusion

Magellan: Magellanal Analytics on Spark Training Course

Posted on September 22, 2023 by admin

Duration

14 hours (usually 2 days including breaks)

Requirements

Experience with Apache Spark
Experience with SQL and data frame queries

Audience

Application developers

Overview

Magellan is an open-source distributed execution engine for geospatial analytics on big data. Implemented on top of Apache Spark, it extends Spark SQL and provides a relational abstraction for geospatial analytics.

This instructor-led, live training introduces the concepts and approaches for implementing geospacial analytics and walks participants through the creation of a predictive analysis application using Magellan on Spark.

By the end of this training, participants will be able to:

Efficiently query, parse and join geospatial datasets at scale
Implement geospatial data in business intelligence and predictive analytics applications
Use spatial context to extend the capabilities of mobile devices, sensors, logs, and wearables

Format of the Course

Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.

Course Customization Options

To request a customized training for this course, please contact us to arrange.

Course Outline

To request a customized course outline for this training, please contact us to arrange.

Spark for Developers Training Course

Posted on September 22, 2023 by admin

Duration

21 hours (usually 3 days including breaks)

Requirements

PRE-REQUISITES

familiarity with either Java / Scala / Python language (our labs in Scala and Python)
basic understanding of Linux development environment (command line navigation / editing files using VI or nano)

Overview

OBJECTIVE:

This course will introduce Apache Spark. The students will learn how Spark fits into the Big Data ecosystem, and how to use Spark for data analysis. The course covers Spark shell for interactive data analysis, Spark internals, Spark APIs, Spark SQL, Spark streaming, and machine learning and graphX.

AUDIENCE :

Developers / Data Analysts

Course Outline

Scala primer
- A quick introduction to Scala
- Labs : Getting know Scala
Spark Basics
- Background and history
- Spark and Hadoop
- Spark concepts and architecture
- Spark eco system (core, spark sql, mlib, streaming)
- Labs : Installing and running Spark
First Look at Spark
- Running Spark in local mode
- Spark web UI
- Spark shell
- Analyzing dataset – part 1
- Inspecting RDDs
- Labs: Spark shell exploration
RDDs
- RDDs concepts
- Partitions
- RDD Operations / transformations
- RDD types
- Key-Value pair RDDs
- MapReduce on RDD
- Caching and persistence
- Labs : creating & inspecting RDDs; Caching RDDs
Spark API programming
- Introduction to Spark API / RDD API
- Submitting the first program to Spark
- Debugging / logging
- Configuration properties
- Labs : Programming in Spark API, Submitting jobs
Spark SQL
- SQL support in Spark
- Dataframes
- Defining tables and importing datasets
- Querying data frames using SQL
- Storage formats : JSON / Parquet
- Labs : Creating and querying data frames; evaluating data formats
MLlib
- MLlib intro
- MLlib algorithms
- Labs : Writing MLib applications
GraphX
- GraphX library overview
- GraphX APIs
- Labs : Processing graph data using Spark
Spark Streaming
- Streaming overview
- Evaluating Streaming platforms
- Streaming operations
- Sliding window operations
- Labs : Writing spark streaming applications
Spark and Hadoop
- Hadoop Intro (HDFS / YARN)
- Hadoop + Spark architecture
- Running Spark on Hadoop YARN
- Processing HDFS files using Spark
Spark Performance and Tuning
- Broadcast variables
- Accumulators
- Memory management & caching
Spark Operations
- Deploying Spark in production
- Sample deployment templates
- Configurations
- Monitoring
- Troubleshooting

Python and Spark for Big Data (PySpark) Training Course

Posted on September 15, 2023 by admin

Duration

21 hours (usually 3 days including breaks)

Requirements

General programming skills

Audience

Developers
IT Professionals
Data Scientists

Overview

Python is a high-level programming language famous for its clear syntax and code readibility. Spark is a data processing engine used in querying, analyzing, and transforming big data. PySpark allows users to interface Spark with Python.

In this instructor-led, live training, participants will learn how to use Python and Spark together to analyze big data as they work on hands-on exercises.

By the end of this training, participants will be able to:

Learn how to use Spark with Python to analyze Big Data.
Work on exercises that mimic real world cases.
Use different tools and techniques for big data analysis using PySpark.

Format of the course

Part lecture, part discussion, exercises and heavy hands-on practice

Course Outline

Introduction

Understanding Big Data

Overview of Spark

Overview of Python

Overview of PySpark

Distributing Data Using Resilient Distributed Datasets Framework
Distributing Computation Using Spark API Operators

Setting Up Python with Spark

Setting Up PySpark

Using Amazon Web Services (AWS) EC2 Instances for Spark

Setting Up Databricks

Setting Up the AWS EMR Cluster

Learning the Basics of Python Programming

Getting Started with Python
Using the Jupyter Notebook
Using Variables and Simple Data Types
Working with Lists
Using if Statements
Using User Inputs
Working with while Loops
Implementing Functions
Working with Classes
Working with Files and Exceptions
Working with Projects, Data, and APIs

Learning the Basics of Spark DataFrame

Getting Started with Spark DataFrames
Implementing Basic Operations with Spark
Using Groupby and Aggregate Operations
Working with Timestamps and Dates

Working on a Spark DataFrame Project Exercise

Understanding Machine Learning with MLlib

Working with MLlib, Spark, and Python for Machine Learning

Understanding Regressions

Learning Linear Regression Theory
Implementing a Regression Evaluation Code
Working on a Sample Linear Regression Exercise
Learning Logistic Regression Theory
Implementing a Logistic Regression Code
Working on a Sample Logistic Regression Exercise

Understanding Random Forests and Decision Trees

Learning Tree Methods Theory
Implementing Decision Trees and Random Forest Codes
Working on a Sample Random Forest Classification Exercise

Working with K-means Clustering

Understanding K-means Clustering Theory
Implementing a K-means Clustering Code
Working on a Sample Clustering Exercise

Working with Recommender Systems

Implementing Natural Language Processing

Understanding Natural Language Processing (NLP)
Overview of NLP Tools
Working on a Sample NLP Exercise

Streaming with Spark on Python

Overview Streaming with Spark
Sample Spark Streaming Exercise

Closing Remarks

Python, Spark, and Hadoop for Big Data Training Course

Posted on September 15, 2023 by admin

Duration

21 hours (usually 3 days including breaks)

Requirements

Experience with Spark and Hadoop
Python programming experience

Audience

Data scientists
Developers

Overview

Python is a scalable, flexible, and widely used programming language for data science and machine learning. Spark is a data processing engine used in querying, analyzing, and transforming big data, while Hadoop is a software library framework for large-scale data storage and processing.

This instructor-led, live training (online or onsite) is aimed at developers who wish to use and integrate Spark, Hadoop, and Python to process, analyze, and transform large and complex data sets.

By the end of this training, participants will be able to:

Set up the necessary environment to start processing big data with Spark, Hadoop, and Python.
Understand the features, core components, and architecture of Spark and Hadoop.
Learn how to integrate Spark, Hadoop, and Python for big data processing.
Explore the tools in the Spark ecosystem (Spark MlLib, Spark Streaming, Kafka, Sqoop, Kafka, and Flume).
Build collaborative filtering recommendation systems similar to Netflix, YouTube, Amazon, Spotify, and Google.
Use Apache Mahout to scale machine learning algorithms.

Format of the Course

Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.

Course Customization Options

To request a customized training for this course, please contact us to arrange.

Course Outline

Introduction

Overview of Spark and Hadoop features and architecture
Understanding big data
Python programming basics

Getting Started

Setting up Python, Spark, and Hadoop
Understanding data structures in Python
Understanding PySpark API
Understanding HDFS and MapReduce

Integrating Spark and Hadoop with Python

Implementing Spark RDD in Python
Processing data using MapReduce
Creating distributed datasets in HDFS

Machine Learning with Spark MLlib

Processing Big Data with Spark Streaming

Working with Recommender Systems

Working with Kafka, Sqoop, Kafka, and Flume

Apache Mahout with Spark and Hadoop

Troubleshooting

Summary and Next Steps