SMACK Stack for Data Science Training Course

Duration

14 hours (usually 2 days including breaks)

Requirements

  • An understanding of data processing systems

Audience

  • Data Scientists

Overview

SMACK is a collection of data platform softwares, namely Apache Spark, Apache Mesos, Apache Akka, Apache Cassandra, and Apache Kafka. Using the SMACK stack, users can create and scale data processing platforms.

This instructor-led, live training (online or onsite) is aimed at data scientists who wish to use the SMACK stack to build data processing platforms for big data solutions.

By the end of this training, participants will be able to:

  • Implement a data pipeline architecture for processing big data.
  • Develop a cluster infrastructure with Apache Mesos and Docker.
  • Analyze data with Spark and Scala.
  • Manage unstructured data with Apache Cassandra.

Format of the Course

  • Interactive lecture and discussion.
  • Lots of exercises and practice.
  • Hands-on implementation in a live-lab environment.

Course Customization Options

  • To request a customized training for this course, please contact us to arrange.

Course Outline

Introduction

SMACK Stack Overview

  • What is Apache Spark? Apache Spark features
  • What is Apache Mesos? Apache Mesos features
  • What is Apache Akka? Apache Akka features
  • What is Apache Cassandra? Apache Cassandra features
  • What is Apache Kafka? Apache Kafka features

Scala Language

  • Scala syntax and structure
  • Scala control flow

Preparing the Development Environment

  • Installing and configuring the SMACK stack
  • Installing and configuring Docker

Apache Akka

  • Using actors

Apache Cassandra

  • Creating a database for read operations
  • Working with backups and recovery

Connectors

  • Creating a stream
  • Building an Akka application
  • Storing data with Cassandra
  • Reviewing connectors

Apache Kafka

  • Working with clusters
  • Creating, publishing, and consuming messages

Apache Mesos

  • Allocating resources
  • Running clusters
  • Working with Apache Aurora and Docker
  • Running services and jobs
  • Deploying Spark, Cassandra, and Kafka on Mesos

Apache Spark

  • Managing data flows
  • Working with RDDs and dataframes
  • Performing data analysis

Troubleshooting

  • Handling failure of services and errors

Summary and Conclusion

Big Data – Data Science Training Course

Duration

14 hours (usually 2 days including breaks)

Requirements

Delegates should have an awareness and some experience of storgage tools and an awreness of handling large data sets

Overview

This classroom based training session will explore Big Data. Delegates will have computer based examples and case study exercises to undertake with relevant big data tools

Course Outline

  1. Big data fundamentals
    • Big Data and its role in the corporate world
    • The phases of development of a Big Data strategy within a corporation
    • Explain the rationale underlying a holistic approach to Big Data
    • Components needed in a Big Data Platform
    • Big data storage solution
    • Limits of Traditional Technologies
    • Overview of database types
    • The four dimensions of Big Data
  2. Big data impact on business
    • Business importance of Big Data
    • Challenges of extracting useful data
    • Integrating Big data with traditional data
  3. Big data storage technologies
    • Overview of big data technologies
      • Data storage models
      • Hadoop
      • Hive
      • Cassandra
      • MongoDB
    • Choosing the right big data technology
  4. Processing big data
    • Connecting and extracting data from database
    • Transforming and preparation data for processing
    • Using Hadoop MapReduce for processing distributed data
    • Monitoring and executing Hadoop MapReduce jobs
    • Hadoop distributed file system building blocks
    • Mapreduce and Yarn
    • Handling streaming data with Spark
  5. Big data analysis tools and technologies
    • Programming Hadoop with Pig Latin language
    • Querying big data with Hive
    • Mining data with Mahout
    • Visualizing and reporting tools
  6. Big data in business
    • Managing and establishing Big Data needs
    • Business importance of Big Data
    • Selecting the right big data tools for the problem

Data Warehousing Concepts

  • What is Data Ware House?
  • Difference between OLTP and Data Ware Housing
  • Data Acquisition
  • Data Extraction
  • Data Transformation.
  • Data Loading
  • Data Marts
  • Dependent vs Independent data Mart
  • Data Base design

ETL Testing Concepts:

  • Introduction.
  • Software development life cycle.
  • Testing methodologies.
  • ETL Testing Work Flow Process.
  • ETL Testing Responsibilities in Data stage.       

Big data Fundamentals

  • Big Data and its role in the corporate world
  • The phases of development of a Big Data strategy within a corporation
  • Explain the rationale underlying a holistic approach to Big Data
  • Components needed in a Big Data Platform
  • Big data storage solution
  • Limits of Traditional Technologies
  • Overview of database types

NoSQL Databases

Hadoop

Map Reduce

Apache Spark

Data Science for Big Data Analytics Training Course

Duration

35 hours (usually 5 days including breaks)

Overview

Big data is data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating and information privacy.

Course Outline

Introduction to Data Science for Big Data Analytics

  • Data Science Overview
  • Big Data Overview
  • Data Structures
  • Drivers and complexities of Big Data
  • Big Data ecosystem and a new approach to analytics
  • Key technologies in Big Data
  • Data Mining process and problems
    • Association Pattern Mining
    • Data Clustering
    • Outlier Detection
    • Data Classification

Introduction to Data Analytics lifecycle

  • Discovery
  • Data preparation
  • Model planning
  • Model building
  • Presentation/Communication of results
  • Operationalization
  • Exercise: Case study

From this point most of the training time (80%) will be spent on examples and exercises in R and related big data technology.

Getting started with R

  • Installing R and Rstudio
  • Features of R language
  • Objects in R
  • Data in R
  • Data manipulation
  • Big data issues
  • Exercises

Getting started with Hadoop

  • Installing Hadoop
  • Understanding Hadoop modes
  • HDFS
  • MapReduce architecture
  • Hadoop related projects overview
  • Writing programs in Hadoop MapReduce
  • Exercises

Integrating R and Hadoop with RHadoop

  • Components of RHadoop
  • Installing RHadoop and connecting with Hadoop
  • The architecture of RHadoop
  • Hadoop streaming with R
  • Data analytics problem solving with RHadoop
  • Exercises

Pre-processing and preparing data

  • Data preparation steps
  • Feature extraction
  • Data cleaning
  • Data integration and transformation
  • Data reduction – sampling, feature subset selection,
  • Dimensionality reduction
  • Discretization and binning
  • Exercises and Case study

Exploratory data analytic methods in R

  • Descriptive statistics
  • Exploratory data analysis
  • Visualization – preliminary steps
  • Visualizing single variable
  • Examining multiple variables
  • Statistical methods for evaluation
  • Hypothesis testing
  • Exercises and Case study

Data Visualizations

  • Basic visualizations in R
  • Packages for data visualization ggplot2, lattice, plotly, lattice
  • Formatting plots in R
  • Advanced graphs
  • Exercises

Regression (Estimating future values)

  • Linear regression
  • Use cases
  • Model description
  • Diagnostics
  • Problems with linear regression
  • Shrinkage methods, ridge regression, the lasso
  • Generalizations and nonlinearity
  • Regression splines
  • Local polynomial regression
  • Generalized additive models
  • Regression with RHadoop
  • Exercises and Case study

Classification

  • The classification related problems
  • Bayesian refresher
  • Naïve Bayes
  • Logistic regression
  • K-nearest neighbors
  • Decision trees algorithm
  • Neural networks
  • Support vector machines
  • Diagnostics of classifiers
  • Comparison of classification methods
  • Scalable classification algorithms
  • Exercises and Case study

Assessing model performance and selection

  • Bias, Variance and model complexity
  • Accuracy vs Interpretability
  • Evaluating classifiers
  • Measures of model/algorithm performance
  • Hold-out method of validation
  • Cross-validation
  • Tuning machine learning algorithms with caret package
  • Visualizing model performance with Profit ROC and Lift curves

Ensemble Methods

  • Bagging
  • Random Forests
  • Boosting
  • Gradient boosting
  • Exercises and Case study

Support vector machines for classification and regression

  • Maximal Margin classifiers
    • Support vector classifiers
    • Support vector machines
    • SVM’s for classification problems
    • SVM’s for regression problems
  • Exercises and Case study

Identifying unknown groupings within a data set

  • Feature Selection for Clustering
  • Representative based algorithms: k-means, k-medoids
  • Hierarchical algorithms: agglomerative and divisive methods
  • Probabilistic base algorithms: EM
  • Density based algorithms: DBSCAN, DENCLUE
  • Cluster validation
  • Advanced clustering concepts
  • Clustering with RHadoop
  • Exercises and Case study

Discovering connections with Link Analysis

  • Link analysis concepts
  • Metrics for analyzing networks
  • The Pagerank algorithm
  • Hyperlink-Induced Topic Search
  • Link Prediction
  • Exercises and Case study

Association Pattern Mining

  • Frequent Pattern Mining Model
  • Scalability issues in frequent pattern mining
  • Brute Force algorithms
  • Apriori algorithm
  • The FP growth approach
  • Evaluation of Candidate Rules
  • Applications of Association Rules
  • Validation and Testing
  • Diagnostics
  • Association rules with R and Hadoop
  • Exercises and Case study

Constructing recommendation engines

  • Understanding recommender systems
  • Data mining techniques used in recommender systems
  • Recommender systems with recommenderlab package
  • Evaluating the recommender systems
  • Recommendations with RHadoop
  • Exercise: Building recommendation engine

Text analysis

  • Text analysis steps
  • Collecting raw text
  • Bag of words
  • Term Frequency –Inverse Document Frequency
  • Determining Sentiments
  • Exercises and Case study

SQL For Data Science and Data Analysis Training Course

Duration

14 hours (usually 2 days including breaks)

Requirements

  • An understanding of  databases
  • Experience with SQL an asset.

Audience

  • Business analysts
  • Software developers
  • Database developers

Overview

This instructor-led, live training (online or onsite) is aimed at software developers, managers, and business analyst who wish to use big data systems to store and retrieve large amounts of data.

By the end of this training, participants will be able to:

  • Query large amounts of data efficiently.
  • Understand how Big Data system store and retrieve data
  • Use the latest big data systems available
  • Wrangle data from data systems into reporting systems
  • Learn to write SQL queries in:
    • MySQL
    • Postgres
    • Hive Query Language (HiveQL/HQL)
    • Redshift 

Format of the Course

  • Interactive lecture and discussion.
  • Lots of exercises and practice.
  • Hands-on implementation in a live-lab environment.

Course Customization Options

  • To request a customized training for this course, please contact us to arrange.

Course Outline

Lesson 1 – SQL basics: 

  • Select statements
  • Join types
  • Indexes
  • Views
  • Subqueries
  • Union
  • Creating tables
  • Loading data
  • Dumping data
  • NoSQL

Lesson 2 – Data Modeling:

  • Transaction based ER systems
  • Data warehousing 
  • Data warehouse models
    • Star schema
    • Snowflake schemas
  • Slowly changing dimensions (SCD)
  • Structured and non-structured data
  • Different table type storage engines:
    • Column based
    • Document-based
    • In Memory

Lesson 3 – Index in the NoSQL/Data science world

  • Constraints (Primary)
  • Index-based scanning
  • performance tuning

Lesson 4 – NoSQL and non-structured data

  • When to use NoSQL
  • Eventually consistent data
  • Schema on read vs. Schema on write

Lesson 5 – SQL for data analytics

  • Windowing function
  • Lateral Joins
  • Lead & Lag

Lesson 6 – HiveQL

  • SQL Support
  • External and Internal Tables
  • Joins
  • Partitions
  • Correlated subqueries
  • Nested queries
  • When to use Hive

Lesson 7 – Redshift

  • Design and structured
  • Locks and shared resources
  • Postgres differences
  • When to use redshift

Visual Analytics – Data science Training Course

Duration

14 hours (usually 2 days including breaks)

Requirements

Experience of analysis, statistics and producing data an advantage

Overview

This classroom based training session will contain presentations and computer based examples and case study exercises to undertake.

Course Outline

  1. Introduction to Visual Analytics
    • 5 Principles of Data Visualisation
    • Tables vs charts
    • What makes visualisations effective
    • Gestalt Principles of Visual Perception
  2. Types of charts and how to choose the right one
    • Common types of charts
    • Choosing the right chart for your data
    • Understanding your audience
    • Handling missing data
  3. Advanced charts
    • Sankey
    • Radar
    • Treemap
    • Heatmap
    • Boxplot, violin plot
    • Choosing the right chart for your data
    • Choosing the right chart for your audience
    • Eliminating clutter from charts
  4. Storytelling with data
    • The importance of storytelling
    • Building a narrative structure
    • Drawing attention
    • Including call to action
  5. Creating dashboards and infographics
    • Exploratory vs explanatory analysis
    • How to convey your message
    • Live presentation vs report
    • Visualisations that are simple, informative and engaging
    • The characteristics of a good dashboard
    • The characteristics of a good infographic
  6. Common mistakes and misleading charts
    • Charts that should be avoided
    • How we are being deceived by colour, scale and size
  7. Visual analytics case studies

Neural computing – Data science Training Course

Duration

14 hours (usually 2 days including breaks)

Requirements

Knowledge/appreciation of machine learning, systems architecutre and programming languages are desirable

Overview

This classroom based training session will contain presentations and computer based examples and case study exercises to undertake with relevant neural and deep network libraries

Course Outline

  1. Overview of neural networks and deep learning
    • The concept of Machine Learning (ML)
    • Why we need neural networks and deep learning?
    • Selecting networks to different problems and data types
    • Learning and validating neural networks
    • Comparing logistic regression to neural network
  2. Neural network
    • Biological inspirations to Neural network
    • Neural Networks– Neuron, Perceptron and MLP(Multilayer Perceptron model)
    • Learning MLP – backpropagation algorithm
    • Activation functions – linear, sigmoid, Tanh, Softmax
    • Loss functions appropriate to forecasting and classification
    • Parameters – learning rate, regularization, momentum
    • Building Neural Networks in Python
    • Evaluating performance of neural networks in Python
  3. Basics of Deep Networks
    • What is deep learning?
    • Architecture of Deep Networks– Parameters, Layers, Activation Functions, Loss functions, Solvers
    • Restricted Boltzman Machines (RBMs)
    • Autoencoders
  4. Deep Networks Architectures
    • Deep Belief Networks(DBN) – architecture, application
    • Autoencoders
    • Restricted Boltzmann Machines
    • Convolutional Neural Network
    • Recursive Neural Network
    • Recurrent Neural Network
  5. Overview of libraries and interfaces available in Python
    • Caffee
    • Theano
    • Tensorflow
    • Keras
    • Mxnet
    • Choosing appropriate library to problem
  6. Building deep networks in Python
    • Choosing appropriate architecture to given problem
    • Hybrid deep networks
    • Learning network – appropriate library, architecture definition
    • Tuning network – initialization, activation functions, loss functions, optimization method
    • Avoiding overfitting – detecting overfitting problems in deep networks, regularization
    • Evaluating deep networks
  7. Case studies in Python
    • Image recognition – CNN
    • Detecting anomalies with Autoencoders
    • Forecasting time series with RNN
    • Dimensionality reduction with Autoencoder
    • Classification with RBM

Snorkel: Rapidly Process Training Data Training Course

Duration

7 hours (usually 1 day including breaks)

Requirements

  • An understanding of machine learning

Overview

Snorkel is a system for rapidly creating, modeling, and managing training data. It focuses on accelerating the development of structured or “dark” data extraction applications for domains in which large labeled training sets are not available or easy to obtain.

In this instructor-led, live training, participants will learn techniques for extracting value from unstructured data such as text, tables, figures, and images through modeling of training data with Snorkel.

By the end of this training, participants will be able to:

  • Programmatically create training sets to enable the labeling of massive training sets
  • Train high-quality end models by first modeling noisy training sets
  • Use Snorkel to implement weak supervision techniques and apply data programming to weakly-supervised machine learning systems

Audience

  • Developers
  • Data scientists

Format of the course

  • Part lecture, part discussion, exercises and heavy hands-on practice

Course Outline

To request a customized course outline for this training, please contact us.

Embedding Projector: Visualizing Your Training Data Training Course

Duration

14 hours (usually 2 days including breaks)

Requirements

  • Working experience with data visualization tools
  • Knowledge of machine learning is helpful, but not required
  • Knowledge of TensorFlow is helpful, but not required

Overview

Embedding Projector is an open-source web application for visualizing the data used to train machine learning systems. Created by Google, it is part of TensorFlow.

This instructor-led, live training introduces the concepts behind Embedding Projector and walks participants through the setup of a demo project.

By the end of this training, participants will be able to:

  • Explore how data is being interpreted by machine learning models
  • Navigate through 3D and 2D views of data to understand how a machine learning algorithm interprets it
  • Understand the concepts behind Embeddings and their role in representing mathematical vectors for images, words and numerals.
  • Explore the properties of a specific embedding to understand the behavior of a model
  • Apply Embedding Project to real-world use cases such building a song recommendation system for music lovers

Audience

  • Developers
  • Data scientists

Format of the course

  • Part lecture, part discussion, exercises and heavy hands-on practice

Course Outline

To request a customized course outline for this training, please contact us.

Introduction to Data Science and AI using Python Training Course

Duration

35 hours (usually 5 days including breaks)

Requirements

None

Overview

This is a 5 day introduction to Data Science and Artificial Intelligence (AI).

The course is delivered with examples and exercises using Python 

Course Outline

Introduction to Data Science/AI

  • Knowledge acquisition through data
  • Knowledge representation
  • Value creation
  • Data Science overview
  • AI ecosystem and new approach to analytics
  • Key technologies

Data Science workflow

  • Crisp-dm
  • Data preparation
  • Model planning
  • Model building
  • Communication
  • Deployment

Data Science technologies

  • Languages used for prototyping
  • Big Data technologies
  • End to end solutions to common problems
  • Introduction to Python language
  • Integrating Python with Spark

AI in Business

  • AI ecosystem
  • Ethics of AI
  • How to drive AI in business

Data sources

  • Types of data
  • SQL vs NoSQL
  • Data Storage
  • Data preparation

Data Analysis – Statistical approach

  • Probability
  • Statistics
  • Statistical modeling
  • Applications in business using Python

Machine learning in business

  • Supervised vs unsupervised
  • Forecasting problems
  • Classfication problems
  • Clustering problems
  • Anomaly detection
  • Recommendation engines
  • Association pattern mining
  • Solving ML problems with Python language

Deep learning

  • Problems where traditional ML algorithms fails
  • Solving complicated problems with Deep Learning
  • Introduction to Tensorflow

Natural Language processing

Data visualization

  • Visual reporting outcomes from modeling
  • Common pitfalls in visualization
  • Data visualization with Python

From Data to Decision – communication

  • Making impact: data driven story telling
  • Influence effectivnes
  • Managing Data Science projects

Mastering Python Programming (April 2023)

You will learn about Basics of Python Programming and its features

You will learn and explore on Cloud Client Libraries in GCP

You will get to know the use of Python in Data Science

You will learn on working with ML application using Python

Requirements

  • If you have an understanding of Basic Python Programming
  • And Working Knowledge on GCP Cloud services

Description

If you are looking for building the skills on Python programming along with Machine learning, Data science and use of Python in cloud platforms, then this is the course for you!

This course takes you through hands-on approach with python programming using IDLE (Python 3.11 64-bit)

Python is an interpreted, high-level and general-purpose programming language. Python is easy to learn and it is powerful programming language. Python has syntax that allows developers to write programs with fewer lines compared to other programming languages

In this course you will learn about Python and its features, data types and data structures in Python. Looping and conditional statements, functions and modules.

You will also learn the OOPs concept of Python, decorators, generators, exception handling and file handling in Python

In this course you will learn to use the Python Libraries in GCP.

And how to use Python in Machine Learning and Data Science.

Our focus is to teach topics that flow smoothly. The course teaches you everything you need to know about python programming with hands-on examples

This course gives a quick introduction to python programming with an emphasis on its activity lessons

What are you waiting for?

Every day is a missed opportunity.

Hurry up!!!!!!

Who this course is for:

  • Developers interested in Mastering Python Programming
  • Python Developers
  • Data Scientist
  • Data Analysts
  • Software Developers and Cloud Developers

Course content