Machine Learning through Case Studies for Beginners

Get a comprehensive understanding of what Machine Learning is

Experience Machine Learning in practice through real world case studies

Get a broad overview of different Machine Learning models


  • None


Are you a beginner looking to get started with Machine Learning? This course offers a gentle introduction to Machine Learning through real world case studies as you invent your first Machine Learning algorithm.

This course gives you a broad overview of the variety of Machine Learning models and provides a learning ladder to continue learning. It also presents applications that use Machine Learning and details a plethora of techniques that are used to evolve Machine Learning models from data.

This course presents data for a simple case study of classifying emails automatically. It provides the data set, identifies features ans labels and presents the intuition behind any Machine Learning algorithm. The course goes on to talk about both the (i) Supervised and (ii) Unsupervised learning models. It presents an analysis of over fitting and under fitting in models.

The course aims to motivate a beginner to get started with their Machine Learning journey. This course will be further supplemented with focused sessions on various regression, classification and clustering algorithms.

The subsequent sessions will get in to the Math behind the algorithm while solving a real world case study. Students who continue this course through the recommended ladder will eventually have the skills to build and deploy Machine Learning models to production.

Who this course is for:

  • Any software developer wanting to begin their journey in Machine Learning and Artificial Intelligence

Course content

Machine Learning with Python

Supervised learning

Unsupervised learning

Regression learning



  • install numpy matplotlib and pandas


Machine Learning tutorial provides basic and advanced concepts of machine learning. Our machine learning tutorial is designed for students and working professionals.

Machine learning is a growing technology which enables computers to learn automatically from past data. Machine learning uses various algorithms for building mathematical models and making predictions using historical data or information. Currently, it is being used for various tasks such as image recognitionspeech recognitionemail filteringFacebook auto-taggingrecommender system, and many more.

This machine learning tutorial gives you an introduction to machine learning along with the wide range of machine learning techniques such as SupervisedUnsupervised, and Reinforcement learning. You will learn about regression and classification models, clustering methods, hidden Markov models, and various sequential models.

When you tag a face in a Facebook photo, it is AI that is running behind the scenes and identifying faces in a picture. Face tagging is now omnipresent in several applications that display pictures with human faces. Why just human faces? There are several applications that detect objects such as cats, dogs, bottles, cars, etc. We have autonomous cars running on our roads that detect objects in real time to steer the car. When you travel, you use Google Directions to learn the real-time traffic situations and follow the best path suggested by Google at that point of time. This is yet another implementation of object detection technique in real time.

Let us consider the example of Google Translate application that we typically use while visiting foreign countries. Google’s online translator app on your mobile helps you communicate with the local people speaking a language that is foreign to you.

There are several applications of AI that we use practically today. In fact, each one of us use AI in many parts of our lives, even without our knowledge. Today’s AI can perform extremely complex jobs with a great accuracy and speed. Let us discuss an example of complex task to understand what capabilities are expected in an AI application that you would be developing today for your clients.


We all use Google Directions during our trip anywhere in the city for a daily commute or even for inter-city travels. Google Directions application suggests the fastest path to our destination at that time instance. When we follow this path, we have observed that Google is almost 100% right in its suggestions and we save our valuable time on the trip.

You can imagine the complexity involved in developing this kind of application considering that there are multiple paths to your destination and the application has to judge the traffic situation in every possible path to give you a travel time estimate for each such path. Besides, consider the fact that Google Directions covers the entire globe. Undoubtedly, lots of AI and Machine Learning techniques are in-use under the hoods of such applications.

Considering the continuous demand for the development of such applications, you will now appreciate why there is a sudden demand for IT professionals with AI skills.

Who this course is for:

  • Python developers curious about Data Science
  • Machine learners
  • Computer Science Engineers

Course content

The Top 5 Machine Learning Libraries in Python

You’ll receive the completely annotated Jupyter Notebook used in the course.

You’ll be able to define and give examples of the top libraries in Python used to build real world predictive models.

You will be able to create models with the most powerful language for machine learning there is.

You’ll understand the supervised predictive modeling process and learn the core vernacular at a high level.


  • There are no prerequisites however knowledge of Python will be helpful.
  • A familiarity with the concepts of machine learning would be helpful but aren’t necessary.


Recent Review from Similar Course:

“This was one of the most useful classes I have taken in a long time. Very specific, real-world examples. It covered several instances of ‘what is happening’, ‘what it means’ and ‘how you fix it’. I was impressed.”  Steve

Welcome to The Top 5 Machine Learning Libraries in Python.  This is an introductory course on the process of building supervised machine learning models and then using libraries in a computer programming language called Python.

What’s the top career in the world? Doctor? Lawyer? Teacher? Nope. None of those.

The top career in the world is the data scientist. Great. What’s a data scientist?

The area of study which involves extracting knowledge from data is called Data Science and people practicing in this field are called as Data Scientists.

Business generate a huge amount of data.  The data has tremendous value but there so much of it where do you begin to look for value that is actionable? That’s where the data scientist comes in.  The job of the data scientist is to create predictive models that can find hidden patterns in data that will give the business a competitive advantage in their space.

Don’t I need a PhD?  Nope. Some data scientists do have PhDs but it’s not a requirement.  A similar career to that of the data scientist is the machine learning engineer.

machine learning engineer is a person who builds predictive models, scores them and then puts them into production so that others in the company can consume or use their model.  They are usually skilled programmers that have a solid background in data mining or other data related professions and they have learned predictive modeling.

In the course we are going to take a look at what machine learning engineers do. We are going to learn about the process of building supervised predictive models and build several using the most widely used programming language for machine learning. Python. There are literally hundreds of libraries we can import into Python that are machine learning related.

library is simply a group of code that lives outside the core language. We “import it” into our work space when we need to use its functionality. We can mix and match these libraries like Lego blocks.

Thanks for your interest in the The Top 5 Machine Learning Libraries in Python and we will see you in the course. 

Who this course is for:

  • If you’re looking to learn machine learning then this course is for you.

Course content

What is Machine Learning?

Overview of Supervised, Unsupervised, and Reinforcement Learning


  • Interest in machine learning


Course Outcome:

Learners completing this course will be able to give definitions and explain the types of problems that can be solved by the 3 broad areas of machine learning: Supervised, Unsupervised, and Reinforcement Learning.

Course Topics and Approach:

This course gives a gentle introduction to the 3 broad areas of machine learning: Supervised, Unsupervised, and Reinforcement Learning. The goal is to explain the key ideas using examples with many plots and animations and little math, so that the material can be accessed by a wide range of learners. The lectures are supplemented by Python demos, which show machine learning in action. Learners are encouraged to experiment with the course demo codes. Additionally, information about machine learning resources is provided, including sources of data and publicly available software packages.

Course Audience:

This course has been designed for ALL LEARNERS!!!

  • Course does not go into detail into the underlying math, so no specific math background is required
  • No previous experience with machine learning is required
  • No previous experience with Python (or programming in general) is required to be able to experiment with the course demo codes

Teaching Style and Resources:

  • Course includes many examples with plots and animations used to help students get a better understanding of the material
  • All resources, including course codes, Powerpoint presentations, info on additional resources, can be downloaded from the course Github site

Python Demos:

There are several options for running the Python demos:

  • Run online using Google Colab (With this option, demo codes can be run completely online, so no downloads are required. A Google account is required.)
  • Run on local machine using the Anaconda platform (This is probably best approach for those who would like to run codes locally, but don’t have python on their local machine. Demo video shows where to get free community version of Anaconda platform and how to run the codes.)
  • Run on local machine using python (This approach may be most suitable for those who already have python on their machines)

2021.09.28 Update

  • Section 5: update course codes, Powerpoint presentations, and videos so that codes are compatible with more recent versions of the Anaconda platform and plotting package

Who this course is for:

  • People curious about machine learning and data science

Course content

Introduction to Data Science for Complete Beginners

What is Data Science

Who is a Data Scientist

Type of Questions that a Data Science Can Answer

Supervised and Unsupervised Learning in Machine Learning with Real life Examples

Applications of Data Science in Real Life

What is Data Engineering

Who is a Data Engineer

What is Machine Learning

Who is a Machine Learning Engineer

Skills Needed to become a Data scientist

How to Practice Data Science and Build your portfolio

Certifications in Data Science

Some Great Books in Data Science


  • Laptop or PC
  • A Good Connection to the internet
  • Passion to Learn about Data Science


Data science and machine learning is one of the hottest fields in the market and has a bright future

In the past ten years, many courses have appeared that explains the field in a more practical way than in theory

During my experience in counseling and mentoring, I faced many obstacles, the most important of which was the existence of educational gaps for the learner, and most of the gaps were in the theoretical field.

To fill this gap, I made this course, Thank God, this course helped many students to properly understand the field of data science.

If you have no idea what the field of data science is and are looking for a very quick introduction to data science, this course will help you become familiar with and understand some of the main concepts underlying data science.

If you are an expert in the field of data science, then attending this course will give you a general overview of the field

This short course will lay a strong foundation for understanding the most important concepts taught in advanced data science courses, and this course will be very suitable if you do not have any idea about the field of data science and want to start learning data science from scratch

Who this course is for:

  • Data Science Enthusiasts
  • People who wants to Become Data Scientists
  • Data Science Aspirants

Course content

Theoretical concepts of Machine Learning

Students will learn about the types of machine learning addressed in Python’s library, sklearn.

Students will learn about supervised learning.

Students will learn about semi-supervised learning.

Students will learn about unsupervised learning.

Students will learn about sklearn’s models used in supervised learning, semi-supervised learning and unsupervised learning.

Students will learn about dimensionality reduction and sklearn functions that adress this.

Students will learn about feature selection and sklearn functions that address this.

Students will learn about data preprocessing and sklearn functions that address this.

Students will learn about hyperparameter tuning and sklearn functions that address this.

Students will learn about goodness of fit tests and sklearn functions that address this.


  • No programming experience is needed, but it would be helpful to know basic Python programming.


This course covers over 27 functions in Python’s machine learning library, sklearn. The functions covered in this course take the student through the entire machine learning life cycle.

The student will learn the types of learning that are part of sklearn, to include supervised, semi-supervised and unsupervised learning.

The student will learn about the types of estimators used in supervised, semi-supervised and unsupervised learning, to include classification and regression.

The student will learn about a variety of supervised learning estimators to include linear regression, logistic regression, decision tree, random forrest, naive bayes, support vector machine, k nearest neighbour, and neural network.

The student will learn about sklearn’s three semi-supervised functions to make predictions on classification problems.

the student will learn about some of the estimators used to make predictions on unsupervised learning, to include k means, hierarchical and Gaussian method.

The student will learn about dimensionality reduction and feature selection as a means of reducing the number of features in the dataset.

The student will learn about the different functions in sklearn that carry out preprocessing activities to include standardisation, normalisation, encoding and imputation.

The student will learn about hyperparameter tuning and how to perform a grid search on the different parameters in the model to help it work at peak optimisation.

The student will learn about goodness of fit tests, to include root mean squared error, accuracy score, confusion matrix, and classification report, which tell the user how well the model has performed.

The students will receive additional learning and cover the machine learning life cycle to enable him to initiate how own machine learning project using sklearn.

Who this course is for:

  • Beginner Python developers who would like to know how to undertake machine learning using Python’s sklearn library.

Course content

Machine Learning Book Classification

How to use Python Pandas for loading dataset

Creating the model in Supervised machine learning

Use pickle to dump the model and vectorizer in the disk

Deploy machine learning model on Django


  • Python, Django And Machine Learning Basics


Become Artificial Intelligence Engineer.

This is a step-by-step course on how to create book classification using machine learning. It covers Numpy, Pandas, Matplotlib, Scikit learns, and Django, and at the end predictive model is deployed on Django. Most of the things machine learning beginners do not know is how they can deploy a created model. How to put created model into the application? The training model and get 80%, 85%, or 90% accuracy does not matter. As Artificial Intelligence Engineer you should be able to put created model into the application.

Actually, learning how to deploy a Machine Learning model created by machine learning is a big win for you and is a motivating effect towards improving, embracing, and learning machine learning. The piece me off when I hear people saying Artificial Intelligence is not really. It is just a theoretical study. Let’s learn together how to deploy models, solve people’s problems and change people’s minds about Artificial Intelligence.

At the end of this course, you will become Artificial Intelligence by your ability to put created models into the application and solve people’s problems. Not only that you will be exposed to a few concepts of Django which are Python web framework and current trending web framework. By understanding Django, you will be able to deploy the previously created model you could not in the previous time.

Who this course is for:

  • Python Developers interested with machine learning

Course content

Self-driving go-kart with Unity-ML

Configure and use the Unity Machine Learning Agents toolkit to solve physical problems in simulated environments

Understand the concepts of neural networks, supervised and deep reinforcement learning (PPO)

Apply ML control techniques to teach a go-kart to drive around a track in Unity


  • Basic algebra and basic programming skills


WARNING: take this class as a gentle introduction to machine learning, with particular focus on machine vision and reinforcement learning. The Unity project provided in this course is now obsolete because the Unity ML agents library is still in its beta version and the interface keeps changing all the time! Some of the implementation details you will find in this course will look different if you are using the latest release, but the key concepts and the background theory are still valid. Please refer to the official migrating documentation on the ml-agents github for the latest updates.

Learn how to combine the beauty of Unity with the power of Tensorflow to solve physical problems in a simulated environment with state-of-the-art machine learning techniques.

We study the problem of a go-kart racing around a simple track and try three different approaches to control it: a simple PID controller; a neural network trained via imitation (supervised) learning; and a neural network trained via deep reinforcement learning.

Each technique has its strengths and weaknesses, which we first show in a theoretical way at simple conceptual level, and then apply in a practical way. In all three cases the go-kart will be able to complete a lap without crashing.

We provide the Unity template and the files for all three solutions. Then see if you can build on it and improve performance further more.

Buckle up and have fun! 

Who this course is for:

  • Students interested in a quick jump into machine learning, focusing on the application rather than the theory
  • Engineers looking for a machine learning realistic simulator

Course content

Artificial Intelligence for Accountants I

What is Artificial Intelligence (AI)

Role of AI in accounting and finance

Various stages of AI

How to use Python to work with AI


  • Python, business and accounting understanding and keen to learn attitude.


Artificial Intelligence for Accountants I :

Are you ready to stay ahead of the game and tackle disruption head-on? As a finance professional, you know that AI is the future, but do you know how to use it to your advantage? Our course, Artificial Intelligence for Accountants I, will give you the tools and knowledge you need to succeed in a rapidly evolving landscape.

Every leader, manager and finance professional now understands the importance of dealing with disruption.

According to the 2018 EY Global Financial Accounting and Advisory Services (FAAS), corporate reporting survey, close to three-quarters (72%) of finance leaders worldwide believed that AI would have a significant impact on how finance drives data-driven insight. However, businesses that dive into the implementation of AI technologies without understanding the associated challenges face substantial risks.

The question is whether an ordinary accountant does understand what AI is. And why do accountants working in various business domains such as financial reporting, financial analysis, compliance, internal and external audit, finance, investments, etc., even worry about Artificial Intelligence?

This is an introductory-level multi-series course on Artificial Intelligence. Which develops basic understanding and explain

– What is intelligence

– What is AI

– Why AI

– A high-level overview of AI applications in accounting and finance

– How you can interact with AI

– Types of AI

– Introduction most popular form of AI, i.e. Machine Learning

– Introduction to Supervised Machine Learning

– Introduction to Python-based popular library Scikit Learn

This series aims to develop next-generation accountants that understand the most complicated technology humans have ever invented.


To get maximum benefit, you would have a basic level of Python knowledge. However, you can still go for this course to gain familiarity with AI.

Who this course is for:

  • Accountants, Business Manager, accounting and finance students, auditors, analysts, data analysts

Course content

Types of machine learning algorithms

Regardless of whether the learner is a human or machine, the basic learning process is similar. It can be divided into four interrelated components:

  • Data storage utilizes observation, memory, and recall to provide a factual basis for further reasoning.
  • Abstraction involves the translation of stored data into broader representations and concepts.
  • Generalization uses abstracted data to create knowledge and inferences that drive action in new contexts.
  • Evaluation provides a feedback mechanism to measure the utility of learned knowledge and inform potential improvements.

Machine learning algorithms are divided into categories according to their purpose.

Main categories are

  • Supervised learning (predictive model, “labeled” data)
    • classification (Logistic Regression, Decision Tree, KNN, Random Forest, SVM, Naive Bayes, etc)
    • numeric prediction (Linear Regression, KNN, Gradient Boosting & AdaBoost, etc)
  • Unsupervised learning (descriptive model, “unlabeled” data)
    • clustering (K-Means)
    • pattern discovery
  • Semi-supervised learning (mixture of “labeled” and “unlabeled” data).
  • Reinforcement learning. Using this algorithm, the machine is trained to make specific decisions. It works this way: the machine is exposed to an environment where it trains itself continually using trial and error. This machine learns from past experience and tries to capture the best possible knowledge to make accurate business decisions. Example of Reinforcement Learning: Markov Decision Process.

There are lots of overlaps in which ML algorithms are applied to a particular problem. As a result, for the same problem, there could be many different ML models possible. So, coming out with the best ML model is an art that requires a lot of patience and trial and error. Following figure provides a brief of all these learning types with sample use cases.

The supervised learning algorithms are a subset of the family of machine learning algorithms which are mainly used in predictive modeling. A predictive model is basically a model constructed from a machine learning algorithm and features or attributes from training data such that we can predict a value using the other values obtained from the input data. Supervised learning algorithms try to model relationships and dependencies between the target prediction output and the input features such that we can predict the output values for new data based on those relationships which it learned from the previous data sets. The main types of supervised learning algorithms include:

  • Classification algorithms: These algorithms build predictive models from training data which have features and class labels. These predictive models in-turn use the features learnt from training data on new, previously unseen data to predict their class labels. The output classes are discrete. Types of classification algorithms include decision treesrandom forestssupport vector machines, and many more.
  • Regression algorithms: These algorithms are used to predict output values based on some input features obtained from the data. To do this, the algorithm builds a model based on features and output values of the training data and this model is used to predict values for new data. The output values in this case are continuous and not discrete. Types of regression algorithms include linear regressionmultivariate regressionregression trees, and lasso regression, among many others.

Some application of supervised learning are speech recognition, credit scoring, medical imaging, and search engines.

The unsupervised learning algorithms are the family of machine learning algorithms which are mainly used in pattern detection and descriptive modeling. However, there are no output categories or labels here based on which the algorithm can try to model relationships. These algorithms try to use techniques on the input data to mine for rulesdetect patterns, and summarize and group the data points which help in deriving meaningful insights and describe the data better to the users. The main types of unsupervised learning algorithms include:

  • Clustering algorithms: The main objective of these algorithms is to cluster or group input data points into different classes or categories using just the features derived from the input data alone and no other external information. Unlike classification, the output labels are not known beforehand in clustering. There are different approaches to build clustering models, such as by using meansmedoidshierarchies, and many more. Some popular clustering algorithms include k-meansk-medoids, and hierarchical clustering.
  • Association rule learning algorithms: These algorithms are used to mine and extract rules and patterns from data sets. These rules explain relationships between different variables and attributes, and also depict frequent item sets and patterns which occur in the data. These rules in turn help discover useful insights for any business or organization from their huge data repositories. Popular algorithms include Apriori and FP Growth.

Some applications of unsupervised learning are customer segmentation in marketing, social network analysis, image segmentation, climatology, and many more.

Semi-Supervised Learning. In the previous two types, either there are no labels for all the observation in the dataset or labels are present for all the observations. Semi-supervised learning falls in between these two. In many practical situations, the cost to label is quite high, since it requires skilled human experts to do that. So, in the absence of labels in the majority of the observations but present in few, semi-supervised algorithms are the best candidates for the model building. These methods exploit the idea that even though the group memberships of the unlabeled data are unknown, this data carries important information about the group parameters.

The reinforcement learning method aims at using observations gathered from the interaction with the environment to take actions that would maximize the reward or minimize the risk. Reinforcement learning algorithm (called the agent) continuously learns from the environment in an iterative fashion. In the process, the agent learns from its experiences of the environment until it explores the full range of possible states.

In order to produce intelligent programs (also called agents), reinforcement learning goes through the following steps:

  1. Input state is observed by the agent.
  2. Decision making function is used to make the agent perform an action.
  3. After the action is performed, the agent receives reward or reinforcement from the environment.
  4. The state-action pair information about the reward is stored.

Some applications of the reinforcement learning algorithms are computer played board games (Chess, Go), robotic hands, and self-driving cars.

Predictive model

predictive model is used for tasks that involve the prediction of one value using other values in the dataset. The learning algorithm attempts to discover and model the relationship between the target feature (the feature being predicted) and the other features. Despite the common use of the word “prediction” to imply forecasting, predictive models need not necessarily foresee events in the future. For instance, a predictive model could be used to predict past events, such as the date of a baby’s conception using the mother’s present-day hormone levels. Predictive models can also be used in real time to control traffic lights during rush hours.

Because predictive models are given clear instruction on what they need to learn and how they are intended to learn it, the process of training a predictive model is known as supervised learning. The supervision does not refer to human involvement, but rather to the fact that the target values provide a way for the learner to know how well it has learned the desired task. Stated more formally, given a set of data, a supervised learning algorithm attempts to optimize a function (the model) to find the combination of feature values that result in the target output.

So, supervised learning consist of a target / outcome variable (or dependent variable) which is to be predicted from a given set of predictors (independent variables). Using these set of variables, we generate a function that map inputs to desired outputs. The training process continues until the model achieves a desired level of accuracy on the training data. Examples of Supervised Learning: Regression, Decision Tree, Random Forest, KNN, Logistic Regression etc.

The often used supervised machine learning task of predicting which category an example belongs to is known as classification. It is easy to think of potential uses for a classifier. For instance, you could predict whether:

  • An e-mail message is spam
  • A person has cancer
  • A football team will win or lose
  • An applicant will default on a loan

In classification, the target feature to be predicted is a categorical feature known as the class, and is divided into categories called levels. A class can have two or more levels, and the levels may or may not be ordinal. Because classification is so widely used in machine learning, there are many types of classification algorithms, with strengths and weaknesses suited for different types of input data.

Supervised learners can also be used to predict numeric data such as income, laboratory values, test scores, or counts of items. To predict such numeric values, a common form of numeric prediction fits linear regression models to the input data. Although regression models are not the only type of numeric models, they are, by far, the most widely used. Regression methods are widely used for forecasting, as they quantify in exact terms the association between inputs and the target, including both, the magnitude and uncertainty of the relationship.

Descriptive model

descriptive model is used for tasks that would benefit from the insight gained from summarizing data in new and interesting ways. As opposed to predictive models that predict a target of interest, in a descriptive model, no single feature is more important than any other. In fact, because there is no target to learn, the process of training a descriptive model is called unsupervised learning. Although it can be more difficult to think of applications for descriptive models, what good is a learner that isn’t learning anything in particular – they are used quite regularly for data mining.

So, in unsupervised learning algorithm, we do not have any target or outcome variable to predict / estimate. It is used for clustering population in different groups, which is widely used for segmenting customers in different groups for specific intervention. Examples of Unsupervised Learning: Apriori algorithm, K-means.

For example, the descriptive modeling task called pattern discovery is used to identify useful associations within data. Pattern discovery is often used for market basket analysis on retailers’ transactional purchase data. Here, the goal is to identify items that are frequently purchased together, such that the learned information can be used to refine marketing tactics. For instance, if a retailer learns that swimming trunks are commonly purchased at the same time as sunglasses, the retailer might reposition the items more closely in the store or run a promotion to “up-sell” customers on associated items.

The descriptive modeling task of dividing a dataset into homogeneous groups is called clustering. This is sometimes used for segmentation analysis that identifies groups of individuals with similar behavior or demographic information, so that advertising campaigns could be tailored for particular audiences. Although the machine is capable of identifying the clusters, human intervention is required to interpret them. For example, given five different clusters of shoppers at a grocery store, the marketing team will need to understand the differences among the groups in order to create a promotion that best suits each group.

Lastly, a class of machine learning algorithms known as meta-learners is not tied to a specific learning task, but is rather focused on learning how to learn more effectively. A meta-learning algorithm uses the result of some learnings to inform additional learning. This can be beneficial for very challenging problems or when a predictive algorithm’s performance needs to be as accurate as possible.

The following table lists only a fraction of the entire set of machine learning algorithms.

ModelLearning task
Supervised Learning Algorithms
Nearest NeighborClassification
Naive BayesClassification
Decision TreesClassification
Classification Rule LearnersClassification
Linear RegressionNumeric prediction
Model TreesNumeric prediction
Regression Trees
Neural NetworksDual use
Support Vector MachinesDual use
Unsupervised Learning Algorithms
Association RulesPattern detection
k-means clusteringClustering
Meta-Learning Algorithms
BaggingDual use
BoostingDual use
Random ForestsDual use

To begin applying machine learning to a real-world project, you will need to determine which of the four learning tasks your project represents: classification, numeric prediction, pattern detection, or clustering. The task will drive the choice of algorithm. For instance, if you are undertaking pattern detection, you are likely to employ association rules. Similarly, a clustering problem will likely utilize the k-means algorithm, and numeric prediction will utilize regression analysis or regression trees.

Torsten Hothorn maintains an exhaustive list of packages available in R for implementing machine learning algorithms.

Model evaluation

Whenever we are building a model, it needs to be tested and evaluated to ensure that it will not only work on trained data, but also on unseen data and can generate results with accuracy. A model should not generate a random result though some noise is permitted. If the model is not evaluated properly then the chances are that the result produced with unseen data is not accurate. Furthermore, model evaluation can help select the optimum model, which is more robust and can accurately predict responses for future subjects.

There are various ways by which a model can be evaluated:

  • Split test. In a split test, the dataset is divided into two parts, one is the training set and the other is test dataset. Once data is split the algorithm will use the training set and a model is created. The accuracy of a model is tested using the test dataset. The ratio of dividing the dataset in training and test can be decided on basis of the size of the dataset. It is fast and great when the dataset is of large size or the dataset is expensive. It can produce different result on how the dataset is divided into the training and test dataset. If the date set is divided in 80% as a training set and 20% as a test set, 60% as a training set and 40%, both will generate different results. We can go for multiple split tests, where the dataset is divided in different ratios and the result is found and compared for accuracy.
  • Cross validation. In cross validation, the dataset is divided in number of parts, for example, dividing the dataset in 10 parts. An algorithm is run on 9 subsets and holds one back for test. This process is repeated 10 times. Based on different results generated on each run, the accuracy is found. It is known as k-fold cross validation is where k is the number in which a dataset is divided. Selecting the k is very crucial here, which is dependent on the size of dataset.
  • Bootstrap. We start with some random samples from the dataset, and an algorithm is run on dataset. This process is repeated for n times until we have all covered the full dataset. In aggregate, the result provided in all repetition shows the model performance.
  • Leave One Out Cross Validation. As the name suggests, only one data point from the dataset is left out, an algorithm is run on the rest of the dataset and it is repeated for each point. As all points from the dataset are covered it is less biased, but it requires higher execution time if the dataset is large.

Model evaluation is a key step in any machine learning process. It is different for supervised and unsupervised models. In supervised models, predictions play a major role; whereas in unsupervised models, homogeneity within clusters and heterogeneity across clusters play a major role.

Some widely used model evaluation parameters for regression models (including cross validation) are as follows:

  • Coefficient of determination
  • Root mean squared error
  • Mean absolute error
  • Akaike or Bayesian information criterion

Some widely used model evaluation parameters for classification models (including cross validation) are as follows:

  • Confusion matrix (accuracy, precision, recall, and F1-score)
  • Gain or lift charts
  • Area under ROC (receiver operating characteristic) curve
  • Concordant and discordant ratio

Some of the widely used evaluation parameters of unsupervised models (clustering) are as follows:

  • Contingency tables
  • Sum of squared errors between clustering objects and cluster centers or centroids
  • Silhouette value
  • Rand index
  • Matching index
  • Pairwise and adjusted pairwise precision and recall (primarily used in NLP)

Bias and variance are two key error components of any supervised model; their trade-off plays a vital role in model tuning and selection. Bias is due to incorrect assumptions made by a predictive model while learning outcomes, whereas variance is due to model rigidity toward the training dataset. In other words, higher bias leads to underfitting and higher variance leads to overfitting of models.

In bias, the assumptions are on target functional forms. Hence, this is dominant in parametric models such as linear regression, logistic regression, and linear discriminant analysis as their outcomes are a functional form of input variables.

Variance, on the other hand, shows how susceptible models are to change in datasets. Generally, target functional forms control variance. Hence, this is dominant in non-parametric models such as decision trees, support vector machines, and K-nearest neighbors as their outcomes are not directly afunctional form of input variables. In other words, the hyperparameters of non-parametric models can lead to overfitting of predictive models.