Apache spark

Databricks Training Course

Posted on November 30, 2023 by admin

Introduction

Overview of Databricks and Apache Spark
Understanding the Databricks architecture

Getting Started

Setting up the Environment
Setting up and configuring Databricks
Navigating the Databricks user interface
Creating a Databricks workspace

Working with Data in Databricks

Connecting to an Apache Spark data source
Understanding the basics columns and datatypes
Managing file system into Notebooks

Managing Jobs and Clusters

Creating and configuring clusters
Creating jobs using Notebook
Running jobs
Viewing jobs and job details

Using Delta Lake in Databricks

Loading data into Delta Lake
Managing data in Delta Lake

Securing Databricks

Managing Databricks security
Managing backup and recovery

Troubleshooting

Apache Spark MLlib Training Course

Posted on September 22, 2023 by admin

Duration

35 hours (usually 5 days including breaks)

Requirements

Knowledge of one of the following:

Java
Scala
Python
SparkR.

Overview

MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.

It divides into two packages:

spark.mllib contains the original API built on top of RDDs.
spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.

Audience

This course is directed at engineers and developers seeking to utilize a built in Machine Library for Apache Spark

Course Outline

spark.mllib: data types, algorithms, and utilities

Data types
Basic statistics
- summary statistics
- correlations
- stratified sampling
- hypothesis testing
- streaming significance testing
- random data generation
Classification and regression
- linear models (SVMs, logistic regression, linear regression)
- naive Bayes
- decision trees
- ensembles of trees (Random Forests and Gradient-Boosted Trees)
- isotonic regression
Collaborative filtering
- alternating least squares (ALS)
Clustering
- k-means
- Gaussian mixture
- power iteration clustering (PIC)
- latent Dirichlet allocation (LDA)
- bisecting k-means
- streaming k-means
Dimensionality reduction
- singular value decomposition (SVD)
- principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
- FP-growth
- association rules
- PrefixSpan
Evaluation metrics
PMML model export
Optimization (developer)
- stochastic gradient descent
- limited-memory BFGS (L-BFGS)

spark.ml: high-level APIs for ML pipelines

Overview: estimators, transformers and pipelines
Extracting, transforming and selecting features
Classification and regression
Clustering
Advanced topics

Apache Spark for .NET Developers Training Course

Posted on September 22, 2023 by admin

Duration

21 hours (usually 3 days including breaks)

Requirements

.NET programming experience using C# or F#

Audience

Developers

Overview

Apache Spark is a distributed processing engine for analyzing very large data sets. It can process data in batches and real-time, as well as carry out machine learning, ad-hoc queries, and graph processing. .NET for Apache Spark is a free, open-source, and cross-platform big data analytics framework that supports applications written in C# or F#.

This instructor-led, live training (online or onsite) is aimed at developers who wish to carry out big data analysis using Apache Spark in their .NET applications.

By the end of this training, participants will be able to:

Install and configure Apache Spark.
Understand how .NET implements Spark APIs so that they can be accessed from a .NET application.
Develop data processing applications using C# or F#, capable of handling data sets whose size is measured in terabytes and pedabytes.
Develop machine learning features for a .NET application using Apache Spark capabilities.
Carry out exploratory analysis using SQL queries on big data sets.

Format of the Course

Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.

Course Customization Options

To request a customized training for this course, please contact us to arrange.

Course Outline

Introduction

Overview of Apache Spark Features and Architecture

Apache Spark modules: Spark SQL, Spark Streaming, MLlib, GraphX
RDD, Dataframes, drive-workers, DAG, etc.

Setting up Apache Spark on .NET

Preparing the Java VM
Running .NET for Apache Spark using .NET Core

Getting Started

Creating a sample .NET console application
Adding the Spark driver
Initializing a SparkSession
Executing the application

Preparing Data

Building a data preparation pipeline
Performing ETL (Extract, Transform, and Load)

Machine Learning

Building a machine learning model
Preparing the data
Training a model

Real-time Processing

Processed streaming data in real-time
Case study: monitoring sensor data

Interactive Query

Working with Spark SQL
Analyzing structured data

Visualizing Results

Plotting results
Using third-party tools to visualize results

Troubleshooting

Summary and Conclusion

Apache Spark Fundamentals Training Course

Posted on September 22, 2023September 22, 2023 by admin

Duration

21 hours (usually 3 days including breaks)

Requirements

Experience with the Linux command line
A general understanding of data processing
Programming experience with Java, Scala, Python, or R

Audience

Developers

Overview

Apache Spark is an analytics engine designed to distribute data across a cluster in order to process it in parallel. It contains modules for streaming, SQL, machine learning and graph processing.

This instructor-led, live training (online or onsite) is aimed at engineers who wish to deploy Apache Spark system for processing very large amounts of data.

By the end of this training, participants will be able to:

Install and configure Apache Spark.
Understand the difference between Apache Spark and Hadoop MapReduce and when to use which.
Quickly read in and analyze very large data sets.
Integrate Apache Spark with other machine learning tools.

Format of the Course

Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.

Course Customization Options

To request a customized training for this course, please contact us to arrange.

Course Outline

Introduction

Apache Spark vs Hadoop MapReduce

Overview of Apache Spark Features and Architecture

Choosing a Programming Language

Setting up Apache Spark

Creating a Sample Application

Choosing the Data Set

Running Data Analysis on the Data

Processing of Structured Data with Spark SQL

Processing Streaming Data with Spark Streaming

Integrating Apache Spark with 3rd Part Machine Learning Tools

Using Apache Spark for Graph Processing

Optimizing Apache Spark

Troubleshooting

Summary and Conclusion

Apache Spark in the Cloud Training Course

Posted on September 22, 2023 by admin

Duration

21 hours (usually 3 days including breaks)

Requirements

Programing skills (preferably python, scala)

SQL basics

Overview

Apache Spark’s learning curve is slowly increasing at the begining, it needs a lot of effort to get the first return. This course aims to jump through the first tough part. After taking this course the participants will understand the basics of Apache Spark , they will clearly differentiate RDD from DataFrame, they will learn Python and Scala API, they will understand executors and tasks, etc. Also following the best practices, this course strongly focuses on cloud deployment, Databricks and AWS. The students will also understand the differences between AWS EMR and AWS Glue, one of the lastest Spark service of AWS.

AUDIENCE:

Data Engineer, DevOps, Data Scientist

Course Outline

Introduction:

Apache Spark in Hadoop Ecosystem
Short intro for python, scala

Basics (theory):

Architecture
RDD
Transformation and Actions
Stage, Task, Dependencies

Using Databricks environment understand the basics (hands-on workshop):

Exercises using RDD API
Basic action and transformation functions
PairRDD
Join
Caching strategies
Exercises using DataFrame API
SparkSQL
DataFrame: select, filter, group, sort
UDF (User Defined Function)
Looking into DataSet API
Streaming

Using AWS environment understand the deployment (hands-on workshop):

Basics of AWS Glue
Understand differencies between AWS EMR and AWS Glue
Example jobs on both environment
Understand pros and cons

Extra:

Introduction to Apache Airflow orchestration

Apache Spark SQL Training Course

Posted on September 22, 2023 by admin

Duration

7 hours (usually 1 day including breaks)

Requirements

Experience with SQL queries
Programming experience in any language

Audience

Data analysts
Data scientists
Data engineers

Overview

Spark SQL is Apache Spark’s module for working with structured and unstructured data. Spark SQL provides information about the structure of the data as well as the computation being performed. This information can be used to perform optimizations. Two common uses for Spark SQL are:
– to execute SQL queries.
– to read data from an existing Hive installation.

In this instructor-led, live training (onsite or remote), participants will learn how to analyze various types of data sets using Spark SQL.

By the end of this training, participants will be able to:

Install and configure Spark SQL.
Perform data analysis using Spark SQL.
Query data sets in different formats.
Visualize data and query results.

Format of the Course

Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.

Course Customization Options

To request a customized training for this course, please contact us to arrange.

Course Outline

Introduction

Overview of Data Access Approaches (Hive, databases, etc.)

Overview of Spark Features and Architecture

Installing and Configuring Spark

Understanding Dataframes in Spark

Defining Tables and Importing Datasets

Querying Data Frames using SQL

Carrying out Aggregations, JOINs and Nested Queries

Uploading and Accessing Data

Querying Different Types of Data

JSON, Parquet, etc.

Querying Data Lakes with SQL

Troubleshooting

Summary and Conclusion

Big Data & Database Systems Fundamentals Training Course

Posted on September 22, 2023 by admin

Duration

14 hours (usually 2 days including breaks)

Overview

The course is part of the Data Scientist skill set (Domain: Data and Technology).

Course Outline

Data Warehousing Concepts

What is Data Ware House?
Difference between OLTP and Data Ware Housing
Data Acquisition
Data Extraction
Data Transformation.
Data Loading
Data Marts
Dependent vs Independent data Mart
Data Base design

ETL Testing Concepts:

Introduction.
Software development life cycle.
Testing methodologies.
ETL Testing Work Flow Process.
ETL Testing Responsibilities in Data stage.

Big data Fundamentals

Big Data and its role in the corporate world
The phases of development of a Big Data strategy within a corporation
Explain the rationale underlying a holistic approach to Big Data
Components needed in a Big Data Platform
Big data storage solution
Limits of Traditional Technologies
Overview of database types

NoSQL Databases

Hadoop

Map Reduce

Machine Learning Fundamentals with Scala and Apache Spark Training Course

Posted on August 30, 2023 by admin

Duration

14 hours (usually 2 days including breaks)

Requirements

Knowledge of Java/Scala programming language. Basic familiarity with statistics and linear algebra is recommended.

Overview

The aim of this course is to provide a basic proficiency in applying Machine Learning methods in practice. Through the use of the Scala programming language and its various libraries, and based on a multitude of practical examples this course teaches how to use the most important building blocks of Machine Learning, how to make data modeling decisions, interpret the outputs of the algorithms and validate the results.

Our goal is to give you the skills to understand and use the most fundamental tools from the Machine Learning toolbox confidently and avoid the common pitfalls of Data Sciences applications.

Course Outline

Introduction to Applied Machine Learning

Statistical learning vs. Machine learning
Iteration and evaluation
Bias-Variance trade-off

Machine Learning with Scala

Choice of libraries
Add-on tools

Regression

Linear regression
Generalizations and Nonlinearity
Exercises

Classification

Bayesian refresher
Naive Bayes
Logistic regression
K-Nearest neighbors
Exercises

Cross-validation and Resampling

Cross-validation approaches
Bootstrap
Exercises

Unsupervised Learning

K-means clustering
Examples
Challenges of unsupervised learning and beyond K-means