Duration
35 hours (usually 5 days including breaks)
Requirements
Knowledge of one of the following:
- Java
- Scala
- Python
- SparkR.
Overview
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.
It divides into two packages:
- spark.mllib contains the original API built on top of RDDs.
- spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.
Audience
This course is directed at engineers and developers seeking to utilize a built in Machine Library for Apache Spark
Course Outline
spark.mllib: data types, algorithms, and utilities
- Data types
- Basic statistics
- summary statistics
- correlations
- stratified sampling
- hypothesis testing
- streaming significance testing
- random data generation
- Classification and regression
- linear models (SVMs, logistic regression, linear regression)
- naive Bayes
- decision trees
- ensembles of trees (Random Forests and Gradient-Boosted Trees)
- isotonic regression
- Collaborative filtering
- alternating least squares (ALS)
- Clustering
- k-means
- Gaussian mixture
- power iteration clustering (PIC)
- latent Dirichlet allocation (LDA)
- bisecting k-means
- streaming k-means
- Dimensionality reduction
- singular value decomposition (SVD)
- principal component analysis (PCA)
- Feature extraction and transformation
- Frequent pattern mining
- FP-growth
- association rules
- PrefixSpan
- Evaluation metrics
- PMML model export
- Optimization (developer)
- stochastic gradient descent
- limited-memory BFGS (L-BFGS)
spark.ml: high-level APIs for ML pipelines
- Overview: estimators, transformers and pipelines
- Extracting, transforming and selecting features
- Classification and regression
- Clustering
- Advanced topics
Duration
21 hours (usually 3 days including breaks)
Requirements
- .NET programming experience using C# or F#
Audience
Overview
Apache Spark is a distributed processing engine for analyzing very large data sets. It can process data in batches and real-time, as well as carry out machine learning, ad-hoc queries, and graph processing. .NET for Apache Spark is a free, open-source, and cross-platform big data analytics framework that supports applications written in C# or F#.
This instructor-led, live training (online or onsite) is aimed at developers who wish to carry out big data analysis using Apache Spark in their .NET applications.
By the end of this training, participants will be able to:
- Install and configure Apache Spark.
- Understand how .NET implements Spark APIs so that they can be accessed from a .NET application.
- Develop data processing applications using C# or F#, capable of handling data sets whose size is measured in terabytes and pedabytes.
- Develop machine learning features for a .NET application using Apache Spark capabilities.
- Carry out exploratory analysis using SQL queries on big data sets.
Format of the Course
- Interactive lecture and discussion.
- Lots of exercises and practice.
- Hands-on implementation in a live-lab environment.
Course Customization Options
- To request a customized training for this course, please contact us to arrange.
Course Outline
Introduction
Overview of Apache Spark Features and Architecture
- Apache Spark modules: Spark SQL, Spark Streaming, MLlib, GraphX
- RDD, Dataframes, drive-workers, DAG, etc.
Setting up Apache Spark on .NET
- Preparing the Java VM
- Running .NET for Apache Spark using .NET Core
Getting Started
- Creating a sample .NET console application
- Adding the Spark driver
- Initializing a SparkSession
- Executing the application
Preparing Data
- Building a data preparation pipeline
- Performing ETL (Extract, Transform, and Load)
Machine Learning
- Building a machine learning model
- Preparing the data
- Training a model
Real-time Processing
- Processed streaming data in real-time
- Case study: monitoring sensor data
Interactive Query
- Working with Spark SQL
- Analyzing structured data
Visualizing Results
- Plotting results
- Using third-party tools to visualize results
Troubleshooting
Summary and Conclusion
Duration
21 hours (usually 3 days including breaks)
Requirements
- Experience with the Linux command line
- A general understanding of data processing
- Programming experience with Java, Scala, Python, or R
Audience
Overview
Apache Spark is an analytics engine designed to distribute data across a cluster in order to process it in parallel. It contains modules for streaming, SQL, machine learning and graph processing.
This instructor-led, live training (online or onsite) is aimed at engineers who wish to deploy Apache Spark system for processing very large amounts of data.
By the end of this training, participants will be able to:
- Install and configure Apache Spark.
- Understand the difference between Apache Spark and Hadoop MapReduce and when to use which.
- Quickly read in and analyze very large data sets.
- Integrate Apache Spark with other machine learning tools.
Format of the Course
- Interactive lecture and discussion.
- Lots of exercises and practice.
- Hands-on implementation in a live-lab environment.
Course Customization Options
- To request a customized training for this course, please contact us to arrange.
Course Outline
Introduction
- Apache Spark vs Hadoop MapReduce
Overview of Apache Spark Features and Architecture
Choosing a Programming Language
Setting up Apache Spark
Creating a Sample Application
Choosing the Data Set
Running Data Analysis on the Data
Processing of Structured Data with Spark SQL
Processing Streaming Data with Spark Streaming
Integrating Apache Spark with 3rd Part Machine Learning Tools
Using Apache Spark for Graph Processing
Optimizing Apache Spark
Troubleshooting
Summary and Conclusion
Duration
21 hours (usually 3 days including breaks)
Requirements
Programing skills (preferably python, scala)
SQL basics
Overview
Apache Spark’s learning curve is slowly increasing at the begining, it needs a lot of effort to get the first return. This course aims to jump through the first tough part. After taking this course the participants will understand the basics of Apache Spark , they will clearly differentiate RDD from DataFrame, they will learn Python and Scala API, they will understand executors and tasks, etc. Also following the best practices, this course strongly focuses on cloud deployment, Databricks and AWS. The students will also understand the differences between AWS EMR and AWS Glue, one of the lastest Spark service of AWS.
AUDIENCE:
Data Engineer, DevOps, Data Scientist
Course Outline
Introduction:
- Apache Spark in Hadoop Ecosystem
- Short intro for python, scala
Basics (theory):
- Architecture
- RDD
- Transformation and Actions
- Stage, Task, Dependencies
Using Databricks environment understand the basics (hands-on workshop):
- Exercises using RDD API
- Basic action and transformation functions
- PairRDD
- Join
- Caching strategies
- Exercises using DataFrame API
- SparkSQL
- DataFrame: select, filter, group, sort
- UDF (User Defined Function)
- Looking into DataSet API
- Streaming
Using AWS environment understand the deployment (hands-on workshop):
- Basics of AWS Glue
- Understand differencies between AWS EMR and AWS Glue
- Example jobs on both environment
- Understand pros and cons
Extra:
- Introduction to Apache Airflow orchestration
Duration
7 hours (usually 1 day including breaks)
Requirements
- Experience with SQL queries
- Programming experience in any language
Audience
- Data analysts
- Data scientists
- Data engineers
Overview
Spark SQL is Apache Spark’s module for working with structured and unstructured data. Spark SQL provides information about the structure of the data as well as the computation being performed. This information can be used to perform optimizations. Two common uses for Spark SQL are:
– to execute SQL queries.
– to read data from an existing Hive installation.
In this instructor-led, live training (onsite or remote), participants will learn how to analyze various types of data sets using Spark SQL.
By the end of this training, participants will be able to:
- Install and configure Spark SQL.
- Perform data analysis using Spark SQL.
- Query data sets in different formats.
- Visualize data and query results.
Format of the Course
- Interactive lecture and discussion.
- Lots of exercises and practice.
- Hands-on implementation in a live-lab environment.
Course Customization Options
- To request a customized training for this course, please contact us to arrange.
Course Outline
Introduction
Overview of Data Access Approaches (Hive, databases, etc.)
Overview of Spark Features and Architecture
Installing and Configuring Spark
Understanding Dataframes in Spark
Defining Tables and Importing Datasets
Querying Data Frames using SQL
Carrying out Aggregations, JOINs and Nested Queries
Uploading and Accessing Data
Querying Different Types of Data
Querying Data Lakes with SQL
Troubleshooting
Summary and Conclusion
Duration
14 hours (usually 2 days including breaks)
Overview
The course is part of the Data Scientist skill set (Domain: Data and Technology).
Course Outline
Data Warehousing Concepts
- What is Data Ware House?
- Difference between OLTP and Data Ware Housing
- Data Acquisition
- Data Extraction
- Data Transformation.
- Data Loading
- Data Marts
- Dependent vs Independent data Mart
- Data Base design
ETL Testing Concepts:
- Introduction.
- Software development life cycle.
- Testing methodologies.
- ETL Testing Work Flow Process.
- ETL Testing Responsibilities in Data stage.
Big data Fundamentals
- Big Data and its role in the corporate world
- The phases of development of a Big Data strategy within a corporation
- Explain the rationale underlying a holistic approach to Big Data
- Components needed in a Big Data Platform
- Big data storage solution
- Limits of Traditional Technologies
- Overview of database types
NoSQL Databases
Hadoop
Map Reduce
Apache Spark
Duration
14 hours (usually 2 days including breaks)
Requirements
Knowledge of Java/Scala programming language. Basic familiarity with statistics and linear algebra is recommended.
Overview
The aim of this course is to provide a basic proficiency in applying Machine Learning methods in practice. Through the use of the Scala programming language and its various libraries, and based on a multitude of practical examples this course teaches how to use the most important building blocks of Machine Learning, how to make data modeling decisions, interpret the outputs of the algorithms and validate the results.
Our goal is to give you the skills to understand and use the most fundamental tools from the Machine Learning toolbox confidently and avoid the common pitfalls of Data Sciences applications.
Course Outline
Introduction to Applied Machine Learning
- Statistical learning vs. Machine learning
- Iteration and evaluation
- Bias-Variance trade-off
Machine Learning with Scala
- Choice of libraries
- Add-on tools
Regression
- Linear regression
- Generalizations and Nonlinearity
- Exercises
Classification
- Bayesian refresher
- Naive Bayes
- Logistic regression
- K-Nearest neighbors
- Exercises
Cross-validation and Resampling
- Cross-validation approaches
- Bootstrap
- Exercises
Unsupervised Learning
- K-means clustering
- Examples
- Challenges of unsupervised learning and beyond K-means