Data Cleaning Training Course

Introduction

Overview of Data Cleaning

  • Why is Data Cleaning Important?

Case Study: When Big Data Is Dirty

Developing A Thorough Data Cleaning Strategy

Common Data Cleaning Tools

  • Drake
  • OpenRefine
  • Pandas (for Python)
  • Dplyr (for R)

Achieving High Data Integrity

  • Complete
  • Correct
  • Accurate
  • Relevant
  • Consistent

Automating the Data Cleaning Process

Monitoring Your Data Cleaning System

Summary and Conclusion

Tableau Prep Builder Training Course

Introduction

  • Tableau Prep Builder as an ETL (Extract, Transform and Load) tool

Setting up Tableau Prep Builder

  • Activating and Register Tableau Prep Builder

Overview of Tableau Prep Builder Features and Architecture

  • Tableau Prep Builder and its relation to Tableau Desktop, Tableau Server and Tableau Prep Conductor

Navigating the Tableau Prep Builder Workspace 

  • Panes and data grids

Connecting to Data Source

  • Reading an Excel
  • Reading from a database

Shaping Data

  • Creating a Join
  • Creating a Union

Pivoting Data

  • Changing columns to rows

Cleaning Data

  • What is dirty data?
  • Changing the data type
  • Filtering data
  • Aggregating data
  • Updating multiple values at once

Running a Flow

  • Building a flow
  • Refreshing a flow
  • Running a flow from the command line

Outputing Data to Tableau Desktop

  • Analyzing data

Publishing Data

  • Packaging a Prep Builder data flow into an extract
  • Publishing a flow with Prep Conductor 

Best Practices 

Troubleshooting

Learn Data Cleaning with Python

Perform Data Cleaning Techniques with the Python Programming Language. Practice and Solution Notebooks included.

Requirements

  • You will need to have basic python programming proficiency.
  • You will need a modern browser i.e. Google Chrome or Mozilla Firefox.

Description

By the end of this course, you will be able to:

  • I can standardize a dataset by fixing inconsistent column names.
  • I can perform data type conversion to fix inaccurate data.
  • I can find and fix syntax errors in a dataset.
  • I can find and fix typos in a dataset.
  • I can deal with irrelevant data in a dataset.
  • I can remove any duplicate records in a dataset.
  • I can find and deal with missing data in a dataset.
  • I can find and deal with outliers in a dataset.

Who this course is for:

  • This course is designed for professionals with an interest in getting hands-on experience with the respective data science techniques and tools.

Course content

7 sections • 7 lectures • 50m total length

How to Develop Machine Learning Applications for Business

oday, most of the businesses rely on machine learning (ML) applications to understand revenue opportunities, identify market trends, predict customer behavior and pricing fluctuations as well as take the right business decisions. Developing these machine learning applications require following diligent planning and steps. Problem framing, data cleaning, feature engineering, model training, and improving model accuracy are a few of the steps that can be followed for developing machine learning applications.

Machine learning being a subset of artificial intelligence technology helps make sense out of historical data as well as helps in decision making. Machine learning is a technique set to find patterns in data and build mathematical models around those findings.

Once we build and train a machine learning algorithm to form a mathematical representation of these data, we can use that model to predict future data. For example, in retail, based on historical purchase data, we can predict whether a user will buy a particular product or not using a learned algorithm.

Types of machine learning algorithms

A machine learning algorithm can be divided into three categories:

  1. Supervised machine learning
  2. Unsupervised machine learning
  3. Reinforcement machine learning

In businesses, we mostly use supervised machine learning algorithms for performing tasks like categorical classification (binary and multiclass), activity monitoring, predicting a numerical value, and a lot more. We also use unsupervised machine learning techniques for a few applications like grouping or clustering, dimensionality reduction, and anomaly detection.

RELATED BLOG

Healthcare and Machine Learning: The Future with Possibilities

While both these approaches have many practical implications for businesses, reinforcement learning (RL) has a very limited business application like path optimization for the transit industry. However, RL is going through extensive research and slowly take over supervised and unsupervised learning. And believe me, RL holds the future for businesses a lot and is super powerful.

A case in point

Why is reinforcement learning so powerful?

Here is a story of AlphaGo and AlphaGoZero.

Go is the world’s oldest board game. It is so complex that if you calculate all the combination from the empty board, it will have combinations of more than the total number of particles in the universe.

DeepMind built AlphaGo algorithm based on reinforcement algorithms that learned by analyzing games and playing against a real player. In Oct 2015, it won against a professional player named Fan Hui by 5-0.

In March 2016, AlphaGo was set to take on the Go champion named Lee Sedol. Every Go expert was sure that it would be very easy for Lee Sedol to beat AlphaGo by 5-0.

Deep Mind invited Fan Hui again to check how good AlphaGo became a trained player with reinforcement learning algorithms at that time and how much it had improved. During the inspection, Fan Hui found a major weakness in AlphaGo, but there was no time to correct it.

To everyone’s surprise, AlphaGo won the game by 4-1.  Lee got a clue about the weakness of AlphaGo and won the fourth round against AlphaGo. However, AlphaGo improved its ability with only one game and won the fifth round against Lee despite its weakness.

AlphaGo was taught the Go game using video feed. The next version named AlphaGoZero learned the game just by playing against itself and feeding basic rules. In just three days of training, it surpassed the ability of AlphaGo, which won against the world champion Lee Sedol.

Although this was achieved by reinforcement learning, inside it, they used deep convolutional neural networks (CNN) to process images. CNN is the type of deep learning algorithms that are widely used in business applications.

When to use machine learning

Machine learning is a powerful tool, but it should not be used frequently for it is computationally extensive and needs training and updating of models on a regular basis. It is sometimes better to rely on conventional software than machine learning.

For certain use cases, we can build a robust solution without machine learning, which can rely on rules, simple calculations or pre-determined processes for results and decision-making. These things are easily programmable and do not need any exhaustive learning. Hence, experts suggest using machine learning in certain special cases and scenarios:

RELATED BLOG

5 Ways to Utilize Artificial Intelligence in Retail for Enhancing In-store Customer Experience

There are two scenarios where we can use machine learning solutions:

  1. Inability to code the rules:
  • Tasks which cannot be done by deploying a set of rules
  • Difficulty identifying and implementing rules
  • Multiple rules to go hand in hand, which are difficult to code
  • Other factors making it difficult to code the rules based on those factors
  • Overlapping rules rendering inaccurate codes
  1. Data scale is high:
  • When you can define rules from a few samples, but it is difficult to scan millions of data sets for a better prediction.

Machine learning can be used for both the above scenarios as it brings out a mathematical model containing rules and can solve large-scale problems.

Steps for developing machine learning applications

Building a machine learning application is an iterative process and follows a set of sequences. Below are the steps involved in for developing machine learning applications:

Problem framing

This first step is to frame a machine learning problem in terms of what we want to predict and what kind of observation data we have to make those predictions. Predictions are generally a label or a target answer; it may be a yes/no label (binary classification) or a category (multiclass classification) or a real number (regression).

Collect and clean the data

Once we frame the problem and identify what kind of historical data we have for prediction modeling, the next step is to collect the data from a historical database or from open datasets or from any other data sources.

Not all the collected data is useful for a machine learning application. We may need to clean the irrelevant data, which may affect the accuracy of prediction or may take additional computation without aiding in the result.

Prepare data for ML application

Once the data ready for the machine learning algorithm, we need to transform the data in the form that the ML system can understand. Machines cannot understand an image or text. We need to convert it into numbers. It also requires building data pipeline depending on the machine learning application needs.

Feature engineering

Sometimes a raw data may not reveal all the facts about the targeted label. Feature engineering is a technique to create additional features combining two or more existing features with an arithmetic operation that is more relevant and sensible.

For example: In a compute engine, it is common for RAM and CPU usage to reach 95%, but something is messy when RAM usage is at 5% and CPU is at 93%. We can use a ration of RAM to CPU usage as a new feature, which may provide a better prediction. If we are using deep learning, it will automatically build features itself; we do not need explicit feature engineering.

Training a model

Before we train the model, we need to split the data into training and evaluation sets, as we need to monitor how well a model generalizes to unseen data. Now, the algorithm will learn the pattern and mapping between the feature and the label.

The learning can be linear or non-linear depending upon the activation function and algorithm. There are a few hyper parameters that affect the learning as well as training time such as learning rate, regularization, batch size, number of passes (epoch), optimization algorithm, and more.

Evaluating and improving model accuracy

Accuracy is a measure to know how good or bad a model is doing on an unseen validation set. Based on the current learnings, we need to evaluate how a model is doing on a validation set. Depending on the application, we can use different accuracy metrics. For e.g. for classification we may use, precision and recall or F1 Score; for object detection, we may use IoU (interaction over union).

If a model is not doing well, we may classify the problem in either of class 1) over-fitting and 2) under-fitting.

When a model is doing well on the training data, but not on the validation data, it is the over-fitting scenario. Somehow model is not generalizing well. The solution for the problem includes regularizing algorithm, decreasing input features, eliminating the redundant feature, and using resampling techniques like k-fold cross-validation.

In the under-fitting scenario, a model does poor on both training and validation dataset. The solution to this may include training with more data, evaluating different algorithms or architectures, using more number of passes, experimenting with learning rate or optimization algorithm.

After an iterative training, the algorithm will learn a model to represent those labels from input data and this model can be used to predict on the unseen data.

Serving with a model in production

After training, the model will do well on the unseen data and now it can be used for prediction. This is the most important thing for businesses. This is also one of the most difficult phases for business-oriented machine learning applications. In this phase, we deploy the model in production for the prediction on real-world data to derive the results.

Wrapping up

Machine learning is the enabler technology, but if we do not follow a proper plan and execution for training and learning of models on algorithms, we may fail. Hence, it is always a great idea for businesses that want to build complex machine learning systems to hire AI and Machine learning service providers and focus on their core competency.

eInfochips provides Artificial Intelligence & Machine Learning offerings to help organizations build highly-customized solutions running on advanced machine learning algorithms. We help companies integrate these algorithms with image & video analytics, as well as with emerging technologies such as augmented reality & virtual reality to deliver utmost customer satisfaction and gain a competitive edge over others. Know more about our machine learning expertise.