Databricks Training Course

Introduction

  • Overview of Databricks and Apache Spark
  • Understanding the Databricks architecture

Getting Started

  • Setting up the Environment
  • Setting up and configuring Databricks
  • Navigating the Databricks user interface
  • Creating a Databricks workspace

Working with Data in Databricks

  • Connecting to an Apache Spark data source
  • Understanding the basics columns and datatypes
  • Managing file system into Notebooks

Managing Jobs and Clusters

  • Creating and configuring clusters
  • Creating jobs using Notebook
  • Running jobs
  • Viewing jobs and job details

Using Delta Lake in Databricks

  • Loading data into Delta Lake
  • Managing data in Delta Lake

Securing Databricks

  • Managing Databricks security
  • Managing backup and recovery

Troubleshooting

Apache Spark MLlib Training Course

Duration

35 hours (usually 5 days including breaks)

Requirements

Knowledge of one of the following:

  • Java
  • Scala
  • Python
  • SparkR.

Overview

MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.

It divides into two packages:

  • spark.mllib contains the original API built on top of RDDs.
  • spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.

Audience

This course is directed at engineers and developers seeking to utilize a built in Machine Library for Apache Spark

Course Outline

spark.mllib: data types, algorithms, and utilities

  • Data types
  • Basic statistics
    • summary statistics
    • correlations
    • stratified sampling
    • hypothesis testing
    • streaming significance testing
    • random data generation
  • Classification and regression
    • linear models (SVMs, logistic regression, linear regression)
    • naive Bayes
    • decision trees
    • ensembles of trees (Random Forests and Gradient-Boosted Trees)
    • isotonic regression
  • Collaborative filtering
    • alternating least squares (ALS)
  • Clustering
    • k-means
    • Gaussian mixture
    • power iteration clustering (PIC)
    • latent Dirichlet allocation (LDA)
    • bisecting k-means
    • streaming k-means
  • Dimensionality reduction
    • singular value decomposition (SVD)
    • principal component analysis (PCA)
  • Feature extraction and transformation
  • Frequent pattern mining
    • FP-growth
    • association rules
    • PrefixSpan
  • Evaluation metrics
  • PMML model export
  • Optimization (developer)
    • stochastic gradient descent
    • limited-memory BFGS (L-BFGS)

spark.ml: high-level APIs for ML pipelines

  • Overview: estimators, transformers and pipelines
  • Extracting, transforming and selecting features
  • Classification and regression
  • Clustering
  • Advanced topics

Apache Spark for .NET Developers Training Course

Duration

21 hours (usually 3 days including breaks)

Requirements

  • .NET programming experience using C# or F#

Audience

  • Developers

Overview

Apache Spark is a distributed processing engine for analyzing very large data sets. It can process data in batches and real-time, as well as carry out machine learning, ad-hoc queries, and graph processing. .NET for Apache Spark is a free, open-source, and cross-platform big data analytics framework that supports applications written in C# or F#.

This instructor-led, live training (online or onsite) is aimed at developers who wish to carry out big data analysis using Apache Spark in their .NET applications.

By the end of this training, participants will be able to:

  • Install and configure Apache Spark.
  • Understand how .NET implements Spark APIs so that they can be accessed from a .NET application.
  • Develop data processing applications using C# or F#, capable of handling data sets whose size is measured in terabytes and pedabytes.
  • Develop machine learning features for a .NET application using Apache Spark capabilities.
  • Carry out exploratory analysis using SQL queries on big data sets.

Format of the Course

  • Interactive lecture and discussion.
  • Lots of exercises and practice.
  • Hands-on implementation in a live-lab environment.

Course Customization Options

  • To request a customized training for this course, please contact us to arrange.

Course Outline

Introduction

Overview of Apache Spark Features and Architecture

  • Apache Spark modules: Spark SQL, Spark Streaming, MLlib, GraphX
  • RDD, Dataframes, drive-workers, DAG, etc.

Setting up Apache Spark on .NET

  • Preparing the Java VM
  • Running .NET for Apache Spark using .NET Core

Getting Started

  • Creating a sample .NET console application
  • Adding the Spark driver
  • Initializing a SparkSession
  • Executing the application

Preparing Data

  • Building a data preparation pipeline
  • Performing ETL (Extract, Transform, and Load)

Machine Learning

  • Building a machine learning model
  • Preparing the data
  • Training a model

Real-time Processing

  • Processed streaming data in real-time
  • Case study: monitoring sensor data

Interactive Query

  • Working with Spark SQL
  • Analyzing structured data

Visualizing Results

  • Plotting results
  • Using third-party tools to visualize results

Troubleshooting

Summary and Conclusion

Apache Spark Fundamentals Training Course

Duration

21 hours (usually 3 days including breaks)

Requirements

  • Experience with the Linux command line
  • A general understanding of data processing
  • Programming experience with Java, Scala, Python, or R

Audience

  • Developers

Overview

Apache Spark is an analytics engine designed to distribute data across a cluster in order to process it in parallel. It contains modules for streaming, SQL, machine learning and graph processing.

This instructor-led, live training (online or onsite) is aimed at engineers who wish to deploy Apache Spark system for processing very large amounts of data.

By the end of this training, participants will be able to:

  • Install and configure Apache Spark.
  • Understand the difference between Apache Spark and Hadoop MapReduce and when to use which.
  • Quickly read in and analyze very large data sets.
  • Integrate Apache Spark with other machine learning tools.

Format of the Course

  • Interactive lecture and discussion.
  • Lots of exercises and practice.
  • Hands-on implementation in a live-lab environment.

Course Customization Options

  • To request a customized training for this course, please contact us to arrange.

Course Outline

Introduction

  • Apache Spark vs Hadoop MapReduce

Overview of Apache Spark Features and Architecture

Choosing a Programming Language

Setting up Apache Spark

Creating a Sample Application

Choosing the Data Set

Running Data Analysis on the Data

Processing of Structured Data with Spark SQL

Processing Streaming Data with Spark Streaming

Integrating Apache Spark with 3rd Part Machine Learning Tools

Using Apache Spark for Graph Processing

Optimizing Apache Spark

Troubleshooting

Summary and Conclusion

Apache Spark in the Cloud Training Course

Duration

21 hours (usually 3 days including breaks)

Requirements

Programing skills (preferably python, scala)

SQL basics

Overview

Apache Spark’s learning curve is slowly increasing at the begining, it needs a lot of effort to get the first return. This course aims to jump through the first tough part. After taking this course the participants will understand the basics of Apache Spark , they will clearly differentiate RDD from DataFrame, they will learn Python and Scala API, they will understand executors and tasks, etc.  Also following the best practices, this course strongly focuses on cloud deployment, Databricks and AWS. The students will also understand the differences between AWS EMR and AWS Glue, one of the lastest Spark service of AWS.  

AUDIENCE:

Data Engineer, DevOps, Data Scientist

Course Outline

Introduction:

  • Apache Spark in Hadoop Ecosystem
  • Short intro for python, scala

Basics (theory):

  • Architecture
  • RDD
  • Transformation and Actions
  • Stage, Task, Dependencies

Using Databricks environment understand the basics (hands-on workshop):

  • Exercises using RDD API
  • Basic action and transformation functions
  • PairRDD
  • Join
  • Caching strategies
  • Exercises using DataFrame API
  • SparkSQL
  • DataFrame: select, filter, group, sort
  • UDF (User Defined Function)
  • Looking into DataSet API
  • Streaming

Using AWS environment understand the deployment (hands-on workshop):

  • Basics of AWS Glue
  • Understand differencies between AWS EMR and AWS Glue
  • Example jobs on both environment
  • Understand pros and cons

Extra:

  • Introduction to Apache Airflow orchestration

Apache Spark SQL Training Course

Duration

7 hours (usually 1 day including breaks)

Requirements

  • Experience with SQL queries
  • Programming experience in any language

Audience

  • Data analysts
  • Data scientists
  • Data engineers

Overview

Spark SQL is Apache Spark’s module for working with structured and unstructured data. Spark SQL provides information about the structure of the data as well as the computation being performed. This information can be used to perform optimizations. Two common uses for Spark SQL are:
– to execute SQL queries.
– to read data from an existing Hive installation.

In this instructor-led, live training (onsite or remote), participants will learn how to analyze various types of data sets using Spark SQL.

By the end of this training, participants will be able to:

  • Install and configure Spark SQL.
  • Perform data analysis using Spark SQL.
  • Query data sets in different formats.
  • Visualize data and query results.

Format of the Course

  • Interactive lecture and discussion.
  • Lots of exercises and practice.
  • Hands-on implementation in a live-lab environment.

Course Customization Options

  • To request a customized training for this course, please contact us to arrange.

Course Outline

Introduction

Overview of Data Access Approaches (Hive, databases, etc.)

Overview of Spark Features and Architecture

Installing and Configuring Spark

Understanding Dataframes in Spark

Defining Tables and Importing Datasets

Querying Data Frames using SQL

Carrying out Aggregations, JOINs and Nested Queries

Uploading and Accessing Data

Querying Different Types of Data

  • JSON, Parquet, etc.

Querying Data Lakes with SQL

Troubleshooting

Summary and Conclusion

Big Data & Database Systems Fundamentals Training Course

Duration

14 hours (usually 2 days including breaks)

Overview

The course is part of the Data Scientist skill set (Domain: Data and Technology).

Course Outline

Data Warehousing Concepts

  • What is Data Ware House?
  • Difference between OLTP and Data Ware Housing
  • Data Acquisition
  • Data Extraction
  • Data Transformation.
  • Data Loading
  • Data Marts
  • Dependent vs Independent data Mart
  • Data Base design

ETL Testing Concepts:

  • Introduction.
  • Software development life cycle.
  • Testing methodologies.
  • ETL Testing Work Flow Process.
  • ETL Testing Responsibilities in Data stage.      

Big data Fundamentals

  • Big Data and its role in the corporate world
  • The phases of development of a Big Data strategy within a corporation
  • Explain the rationale underlying a holistic approach to Big Data
  • Components needed in a Big Data Platform
  • Big data storage solution
  • Limits of Traditional Technologies
  • Overview of database types

NoSQL Databases

Hadoop

Map Reduce

Apache Spark

Machine Learning Fundamentals with Scala and Apache Spark Training Course

Duration

14 hours (usually 2 days including breaks)

Requirements

Knowledge of Java/Scala programming language. Basic familiarity with statistics and linear algebra is recommended.

Overview

The aim of this course is to provide a basic proficiency in applying Machine Learning methods in practice. Through the use of the Scala programming language and its various libraries, and based on a multitude of practical examples this course teaches how to use the most important building blocks of Machine Learning, how to make data modeling decisions, interpret the outputs of the algorithms and validate the results.

Our goal is to give you the skills to understand and use the most fundamental tools from the Machine Learning toolbox confidently and avoid the common pitfalls of Data Sciences applications.

Course Outline

Introduction to Applied Machine Learning

  • Statistical learning vs. Machine learning
  • Iteration and evaluation
  • Bias-Variance trade-off

Machine Learning with Scala

  • Choice of libraries
  • Add-on tools

Regression

  • Linear regression
  • Generalizations and Nonlinearity
  • Exercises

Classification

  • Bayesian refresher
  • Naive Bayes
  • Logistic regression
  • K-Nearest neighbors
  • Exercises

Cross-validation and Resampling

  • Cross-validation approaches
  • Bootstrap
  • Exercises

Unsupervised Learning

  • K-means clustering
  • Examples
  • Challenges of unsupervised learning and beyond K-means