Data Science with Tableau and R Programming Training Course

Introduction

Core Programming and Syntax in R

  • Variables
  • Loops
  • Conditional statements

Fundamentals of R

  • What are vectors?
  • Functions and packages in R

Preparing the Development Environment

  • Installing and configuring R and RStudio
  • Setting up Rserve

Classifying Data

  • Moving data between R and Tableau
  • Preparing and cleaning data
  • Modeling and scripting in R

Regressions in R and Tableau

  • Creating a regression model
  • Visualizing regressions
  • Predicting and comparing values

Clustering and Models

  • Working with clustering algorithms
  • Creating clusters
  • Visualizing clustered data

Advanced Analytics with R and Tableau

  • Using CRISP-DM
  • Working with TDSP models
  • Summarizing data

Data Analysis with Tableau and Python Training Course

  • Introduction
  • Overview of Tableau and the TabPy API
  • Exploring Use Cases of TabPy for Data Scientists
  • Installing and Setting Up TabPy
  • Setting Up Tableau Desktop with Python
  • Configuring a TabPy Connection on Tableau
  • Passing Expressions to Python
  • Running Python Scripts on Tableau
  • Estimating the Probability of Customer Churn Using Logistic Regression
  • Getting Sentiment Scores for Reviews of Products Sold
  • Scoring User Behavior using a Predictive Model
  • Using Findings to Create an Efficient Conversion Funnel

Data Analytics With R Training Course

Day One: Language Basics

  • Course Introduction
  • About Data Science
    • Data Science Definition
    • Process of Doing Data Science.
  • Introducing R Language
  • Variables and Types
  • Control Structures (Loops / Conditionals)
  • R Scalars, Vectors, and Matrices
    • Defining R Vectors
    • Matricies
  • String and Text Manipulation
    • Character data type
    • File IO
  • Lists
  • Functions
    • Introducing Functions
    • Closures
    • lapply/sapply functions
  • DataFrames
  • Labs for all sections

Day Two: Intermediate R Programming

  • DataFrames and File I/O
  • Reading data from files
  • Data Preparation
  • Built-in Datasets
  • Visualization
    • Graphics Package
    • plot() / barplot() / hist() / boxplot() / scatter plot
    • Heat Map
    • ggplot2 package (qplot(), ggplot())
  • Exploration With Dplyr
  • Labs for all sections

Day Three: Advanced Programming With R

  • Statistical Modeling With R
    • Statistical Functions
    • Dealing With NA
    • Distributions (Binomial, Poisson, Normal)
  • Regression
    • Introducing Linear Regressions
  • Recommendations
  • Text Processing (tm package / Wordclouds)
  • Clustering
    • Introduction to Clustering
    • KMeans
  • Classification
    • Introduction to Classification
    • Naive Bayes
    • Decision Trees
    • Training using caret package
    • Evaluating Algorithms
  • R and Big Data
    • Connecting R to databases
    • Big Data Ecosystem
  • Labs for all sections

Python in Data Science Training Course

Duration

35 hours (usually 5 days including breaks)

Requirements

  • An understanding of Data Structure.
  • Experience with Programming.

Audience

  • Programmers
  • Data Scientist
  • Engineers

Overview

The training course will help the participants prepare for Web Application Development using Python Programming with Data Analytics. Such data visualization is a great tool for Top Management in decision making.

Course Outline

Day 1

  1. Data Science
  2. Data Science Team Composition (Data Scientist, Data Engineer, Data Visualizer, Process Owner)
  3. Business Intelligence
    1. Types of Business Intelligence
    2. Developing Business Intelligence Tools
    3. Business Intelligence and the Data Visualization
  4. Data Visualization
    1. Importance of Data Visualization
    2. The Visual Data Presentation
    3. The Data Visualization Tools (infographics, dials and gauges, geographic maps, sparklines, heat maps, and detailed bar, pie and fever charts)
    4. Painting by Numbers and Playing with Colors in Making Visual Stories
  5. Activity

Day 2

  1. Data Visualization in Python Programming
    1. Data Science with Python
    2. Review on Python Fundamentals
  1. Variables and Data Types (str, numeric, sequence, mapping, set types, Boolean, binary, casting)
  2. Operators, Lists, Tuples. Sets, Dictionaries
  3. Conditional Statements
  4. Functions, Lambda, Arrays, Classes, Objects, Inheritance, Iterators
  5. Scope, Modules, Dates, JSON, RegEx, PIP
  6. Try / Except, Command Input, String Formatting
  7. File Handling
  8. Activity

Day 3

  1. Python and MySQL
  1. Creating Database and Table
  2. Manipulating Database (Insert, Select, Update, Delete, Where Statement, Order by)
  3. Drop Table
  4. Limit
  5. Joining Tables
  6. Removing List Duplicates
  7. Reverse a String
  1. Data Visualization with Python and MySQL
    1. Using Matplotlib (Basic Plotting)
    2. Dictionaries and Pandas
    3. Logic, Control Flow and Filtering
    4. Manipulating Graphs Properties (Font, Size, Color Scheme)
  2. Activity

Day 4

  1. Plotting Data in Different Graph Format
    • Histogram
    • Line
    • Bar
    • Box Plot
    • Pie Chart
    • Donut
    • Scatter Plot
    • Radar
    • Area
    • 2D / 3D Density Plot
    • Dendogram
    • Map (Bubble, Heat)
    • Stacked Chart
    • Venn Diagram
    • Seaborn
  2. Activity

Day 5

  1. Data Visualization with Python and MySQL
    1. Group Work: Create a Top Management Data Visualization Presentation Using ITDI Local ULIMS Data
    2. Presentation of Output

Moving Data from MySQL to Hadoop with Sqoop Training Course

Duration

14 hours (usually 2 days including breaks)

Requirements

  • An understanding of big data concepts (HDFS, Hive, etc.)
  • An understanding of relational databases (MySQL, etc.)
  • Experience with the Linux command line

Overview

Sqoop is an open source software tool for transfering data between Hadoop and relational databases or mainframes. It can be used to import data from a relational database management system (RDBMS) such as MySQL or Oracle or a mainframe into the Hadoop Distributed File System (HDFS). Thereafter, the data can be transformed in Hadoop MapReduce, and then re-exported back into an RDBMS.

In this instructor-led, live training, participants will learn how to use Sqoop to import data from a traditional relational database to Hadoop storage such HDFS or Hive and vice versa.

By the end of this training, participants will be able to:

  • Install and configure Sqoop
  • Import data from MySQL to HDFS and Hive
  • Import data from HDFS and Hive to MySQL

Audience

  • System administrators
  • Data engineers

Format of the Course

  • Part lecture, part discussion, exercises and heavy hands-on practice

Note

  • To request a customized training for this course, please contact us to arrange.

Course Outline

Introduction

  • Moving data from legacy data stores to Hadoop

Installing and Configuring Sqoop

Overview of Sqoop Features and Architecture

Importing Data from MySQL to HDFS

Importing Data from MySQL to Hive

Transforming Data in Hadoop

Importing Data from HDFS to MySQL

Importing Data from Hive to MySQL

Importing Incrementally with Sqoop Jobs

Troubleshooting

Summary and Conclusion

Data Mining with R Training Course

Duration

14 hours (usually 2 days including breaks)

Requirements

Good R knowledge.

Overview

R is an open-source free programming language for statistical computing, data analysis, and graphics. R is used by a growing number of managers and data analysts inside corporations and academia. R has a wide variety of packages for data mining.

Course Outline

Sources of methods

  • Artificial intelligence
  • Machine learning
  • Statistics
  • Sources of data

Pre processing of data

  • Data Import/Export
  • Data Exploration and Visualization
  • Dimensionality Reduction
  • Dealing with missing values
  • R Packages

Data mining main tasks

  • Automatic or semi-automatic analysis of large quantities of data
  • Extracting previously unknown interesting patterns
    • groups of data records (cluster analysis)
    • unusual records (anomaly detection)
    • dependencies (association rule mining)

Data mining

  • Anomaly detection (Outlier/change/deviation detection)
  • Association rule learning (Dependency modeling)
  • Clustering
  • Classification
  • Regression
  • Summarization
  • Frequent Pattern Mining
  • Text Mining
  • Decision Trees
  • Regression
  • Neural Networks
  • Sequence Mining
  • Frequent Pattern Mining

Data dredging, data fishing, data snooping

SMACK Stack for Data Science Training Course

Duration

14 hours (usually 2 days including breaks)

Requirements

  • An understanding of data processing systems

Audience

  • Data Scientists

Overview

SMACK is a collection of data platform softwares, namely Apache Spark, Apache Mesos, Apache Akka, Apache Cassandra, and Apache Kafka. Using the SMACK stack, users can create and scale data processing platforms.

This instructor-led, live training (online or onsite) is aimed at data scientists who wish to use the SMACK stack to build data processing platforms for big data solutions.

By the end of this training, participants will be able to:

  • Implement a data pipeline architecture for processing big data.
  • Develop a cluster infrastructure with Apache Mesos and Docker.
  • Analyze data with Spark and Scala.
  • Manage unstructured data with Apache Cassandra.

Format of the Course

  • Interactive lecture and discussion.
  • Lots of exercises and practice.
  • Hands-on implementation in a live-lab environment.

Course Customization Options

  • To request a customized training for this course, please contact us to arrange.

Course Outline

Introduction

SMACK Stack Overview

  • What is Apache Spark? Apache Spark features
  • What is Apache Mesos? Apache Mesos features
  • What is Apache Akka? Apache Akka features
  • What is Apache Cassandra? Apache Cassandra features
  • What is Apache Kafka? Apache Kafka features

Scala Language

  • Scala syntax and structure
  • Scala control flow

Preparing the Development Environment

  • Installing and configuring the SMACK stack
  • Installing and configuring Docker

Apache Akka

  • Using actors

Apache Cassandra

  • Creating a database for read operations
  • Working with backups and recovery

Connectors

  • Creating a stream
  • Building an Akka application
  • Storing data with Cassandra
  • Reviewing connectors

Apache Kafka

  • Working with clusters
  • Creating, publishing, and consuming messages

Apache Mesos

  • Allocating resources
  • Running clusters
  • Working with Apache Aurora and Docker
  • Running services and jobs
  • Deploying Spark, Cassandra, and Kafka on Mesos

Apache Spark

  • Managing data flows
  • Working with RDDs and dataframes
  • Performing data analysis

Troubleshooting

  • Handling failure of services and errors

Summary and Conclusion

Big Data – Data Science Training Course

Duration

14 hours (usually 2 days including breaks)

Requirements

Delegates should have an awareness and some experience of storgage tools and an awreness of handling large data sets

Overview

This classroom based training session will explore Big Data. Delegates will have computer based examples and case study exercises to undertake with relevant big data tools

Course Outline

  1. Big data fundamentals
    • Big Data and its role in the corporate world
    • The phases of development of a Big Data strategy within a corporation
    • Explain the rationale underlying a holistic approach to Big Data
    • Components needed in a Big Data Platform
    • Big data storage solution
    • Limits of Traditional Technologies
    • Overview of database types
    • The four dimensions of Big Data
  2. Big data impact on business
    • Business importance of Big Data
    • Challenges of extracting useful data
    • Integrating Big data with traditional data
  3. Big data storage technologies
    • Overview of big data technologies
      • Data storage models
      • Hadoop
      • Hive
      • Cassandra
      • MongoDB
    • Choosing the right big data technology
  4. Processing big data
    • Connecting and extracting data from database
    • Transforming and preparation data for processing
    • Using Hadoop MapReduce for processing distributed data
    • Monitoring and executing Hadoop MapReduce jobs
    • Hadoop distributed file system building blocks
    • Mapreduce and Yarn
    • Handling streaming data with Spark
  5. Big data analysis tools and technologies
    • Programming Hadoop with Pig Latin language
    • Querying big data with Hive
    • Mining data with Mahout
    • Visualizing and reporting tools
  6. Big data in business
    • Managing and establishing Big Data needs
    • Business importance of Big Data
    • Selecting the right big data tools for the problem

Data Warehousing Concepts

  • What is Data Ware House?
  • Difference between OLTP and Data Ware Housing
  • Data Acquisition
  • Data Extraction
  • Data Transformation.
  • Data Loading
  • Data Marts
  • Dependent vs Independent data Mart
  • Data Base design

ETL Testing Concepts:

  • Introduction.
  • Software development life cycle.
  • Testing methodologies.
  • ETL Testing Work Flow Process.
  • ETL Testing Responsibilities in Data stage.       

Big data Fundamentals

  • Big Data and its role in the corporate world
  • The phases of development of a Big Data strategy within a corporation
  • Explain the rationale underlying a holistic approach to Big Data
  • Components needed in a Big Data Platform
  • Big data storage solution
  • Limits of Traditional Technologies
  • Overview of database types

NoSQL Databases

Hadoop

Map Reduce

Apache Spark

Data Science for Big Data Analytics Training Course

Duration

35 hours (usually 5 days including breaks)

Overview

Big data is data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating and information privacy.

Course Outline

Introduction to Data Science for Big Data Analytics

  • Data Science Overview
  • Big Data Overview
  • Data Structures
  • Drivers and complexities of Big Data
  • Big Data ecosystem and a new approach to analytics
  • Key technologies in Big Data
  • Data Mining process and problems
    • Association Pattern Mining
    • Data Clustering
    • Outlier Detection
    • Data Classification

Introduction to Data Analytics lifecycle

  • Discovery
  • Data preparation
  • Model planning
  • Model building
  • Presentation/Communication of results
  • Operationalization
  • Exercise: Case study

From this point most of the training time (80%) will be spent on examples and exercises in R and related big data technology.

Getting started with R

  • Installing R and Rstudio
  • Features of R language
  • Objects in R
  • Data in R
  • Data manipulation
  • Big data issues
  • Exercises

Getting started with Hadoop

  • Installing Hadoop
  • Understanding Hadoop modes
  • HDFS
  • MapReduce architecture
  • Hadoop related projects overview
  • Writing programs in Hadoop MapReduce
  • Exercises

Integrating R and Hadoop with RHadoop

  • Components of RHadoop
  • Installing RHadoop and connecting with Hadoop
  • The architecture of RHadoop
  • Hadoop streaming with R
  • Data analytics problem solving with RHadoop
  • Exercises

Pre-processing and preparing data

  • Data preparation steps
  • Feature extraction
  • Data cleaning
  • Data integration and transformation
  • Data reduction – sampling, feature subset selection,
  • Dimensionality reduction
  • Discretization and binning
  • Exercises and Case study

Exploratory data analytic methods in R

  • Descriptive statistics
  • Exploratory data analysis
  • Visualization – preliminary steps
  • Visualizing single variable
  • Examining multiple variables
  • Statistical methods for evaluation
  • Hypothesis testing
  • Exercises and Case study

Data Visualizations

  • Basic visualizations in R
  • Packages for data visualization ggplot2, lattice, plotly, lattice
  • Formatting plots in R
  • Advanced graphs
  • Exercises

Regression (Estimating future values)

  • Linear regression
  • Use cases
  • Model description
  • Diagnostics
  • Problems with linear regression
  • Shrinkage methods, ridge regression, the lasso
  • Generalizations and nonlinearity
  • Regression splines
  • Local polynomial regression
  • Generalized additive models
  • Regression with RHadoop
  • Exercises and Case study

Classification

  • The classification related problems
  • Bayesian refresher
  • Naïve Bayes
  • Logistic regression
  • K-nearest neighbors
  • Decision trees algorithm
  • Neural networks
  • Support vector machines
  • Diagnostics of classifiers
  • Comparison of classification methods
  • Scalable classification algorithms
  • Exercises and Case study

Assessing model performance and selection

  • Bias, Variance and model complexity
  • Accuracy vs Interpretability
  • Evaluating classifiers
  • Measures of model/algorithm performance
  • Hold-out method of validation
  • Cross-validation
  • Tuning machine learning algorithms with caret package
  • Visualizing model performance with Profit ROC and Lift curves

Ensemble Methods

  • Bagging
  • Random Forests
  • Boosting
  • Gradient boosting
  • Exercises and Case study

Support vector machines for classification and regression

  • Maximal Margin classifiers
    • Support vector classifiers
    • Support vector machines
    • SVM’s for classification problems
    • SVM’s for regression problems
  • Exercises and Case study

Identifying unknown groupings within a data set

  • Feature Selection for Clustering
  • Representative based algorithms: k-means, k-medoids
  • Hierarchical algorithms: agglomerative and divisive methods
  • Probabilistic base algorithms: EM
  • Density based algorithms: DBSCAN, DENCLUE
  • Cluster validation
  • Advanced clustering concepts
  • Clustering with RHadoop
  • Exercises and Case study

Discovering connections with Link Analysis

  • Link analysis concepts
  • Metrics for analyzing networks
  • The Pagerank algorithm
  • Hyperlink-Induced Topic Search
  • Link Prediction
  • Exercises and Case study

Association Pattern Mining

  • Frequent Pattern Mining Model
  • Scalability issues in frequent pattern mining
  • Brute Force algorithms
  • Apriori algorithm
  • The FP growth approach
  • Evaluation of Candidate Rules
  • Applications of Association Rules
  • Validation and Testing
  • Diagnostics
  • Association rules with R and Hadoop
  • Exercises and Case study

Constructing recommendation engines

  • Understanding recommender systems
  • Data mining techniques used in recommender systems
  • Recommender systems with recommenderlab package
  • Evaluating the recommender systems
  • Recommendations with RHadoop
  • Exercise: Building recommendation engine

Text analysis

  • Text analysis steps
  • Collecting raw text
  • Bag of words
  • Term Frequency –Inverse Document Frequency
  • Determining Sentiments
  • Exercises and Case study

SQL For Data Science and Data Analysis Training Course

Duration

14 hours (usually 2 days including breaks)

Requirements

  • An understanding of  databases
  • Experience with SQL an asset.

Audience

  • Business analysts
  • Software developers
  • Database developers

Overview

This instructor-led, live training (online or onsite) is aimed at software developers, managers, and business analyst who wish to use big data systems to store and retrieve large amounts of data.

By the end of this training, participants will be able to:

  • Query large amounts of data efficiently.
  • Understand how Big Data system store and retrieve data
  • Use the latest big data systems available
  • Wrangle data from data systems into reporting systems
  • Learn to write SQL queries in:
    • MySQL
    • Postgres
    • Hive Query Language (HiveQL/HQL)
    • Redshift 

Format of the Course

  • Interactive lecture and discussion.
  • Lots of exercises and practice.
  • Hands-on implementation in a live-lab environment.

Course Customization Options

  • To request a customized training for this course, please contact us to arrange.

Course Outline

Lesson 1 – SQL basics: 

  • Select statements
  • Join types
  • Indexes
  • Views
  • Subqueries
  • Union
  • Creating tables
  • Loading data
  • Dumping data
  • NoSQL

Lesson 2 – Data Modeling:

  • Transaction based ER systems
  • Data warehousing 
  • Data warehouse models
    • Star schema
    • Snowflake schemas
  • Slowly changing dimensions (SCD)
  • Structured and non-structured data
  • Different table type storage engines:
    • Column based
    • Document-based
    • In Memory

Lesson 3 – Index in the NoSQL/Data science world

  • Constraints (Primary)
  • Index-based scanning
  • performance tuning

Lesson 4 – NoSQL and non-structured data

  • When to use NoSQL
  • Eventually consistent data
  • Schema on read vs. Schema on write

Lesson 5 – SQL for data analytics

  • Windowing function
  • Lateral Joins
  • Lead & Lag

Lesson 6 – HiveQL

  • SQL Support
  • External and Internal Tables
  • Joins
  • Partitions
  • Correlated subqueries
  • Nested queries
  • When to use Hive

Lesson 7 – Redshift

  • Design and structured
  • Locks and shared resources
  • Postgres differences
  • When to use redshift