Machine learning algorithms

What are machine learning algorithms?

Machine learning algorithms are pieces of code that help people explore, analyze, and find meaning in complex data sets. Each algorithm is a finite set of unambiguous step-by-step instructions that a machine can follow to achieve a certain goal. In a machine learning model, the goal is to establish or discover patterns that people can use to make predictions or categorize information. What is machine learning?

Machine learning algorithms use parameters that are based on training data—a subset of data that represents the larger set. As the training data expands to represent the world more realistically, the algorithm calculates more accurate results.

Different algorithms analyze data in different ways. They’re often grouped by the machine learning techniques that they’re used for: supervised learning, unsupervised learning, and reinforcement learning. The most commonly used algorithms use regression and classification to predict target categories, find unusual data points, predict values, and discover similarities.

Machine learning techniques

As you learn more about machine learning algorithms, you’ll find that they typically fall within one of three machine learning techniques:

Supervised learning

In supervised learning, algorithms make predictions based on a set of labeled examples that you provide. This technique is useful when you know what the outcome should look like.

For example, you provide a dataset that includes city populations by year for the past 100 years, and you want to know what the population of a specific city will be four years from now. The outcome uses labels that already exist in the data set: population, city, and year.

Unsupervised learning

In unsupervised learning, the data points aren’t labeled—the algorithm labels them for you by organizing the data or describing its structure. This technique is useful when you don’t know what the outcome should look like.

For example, you provide customer data, and you want to create segments of customers who like similar products. The data that you’re providing isn’t labeled, and the labels in the outcome are generated based on the similarities that were discovered between data points.

Reinforcement learning

Reinforcement learning uses algorithms that learn from outcomes and decide which action to take next. After each action, the algorithm receives feedback that helps it determine whether the choice it made was correct, neutral, or incorrect. It’s a good technique to use for automated systems that have to make a lot of small decisions without human guidance.

For example, you’re designing an autonomous car, and you want to ensure that it’s obeying the law and keeping people safe. As the car gains experience and a history of reinforcement, it learns how to stay in its lane, go the speed limit, and brake for pedestrians.

What you can do with machine learning algorithms

Machine learning algorithms help you answer questions that are too complex to answer through manual analysis. There are many different machine learning algorithm types, but use cases for machine learning algorithms typically fall into one of these categories.

Predict a target category

Two-class (binary) classification algorithms divide the data into two categories. They’re useful for questions that have only two possible answers that are mutually exclusive, including yes/no questions. For example:

  • Will this tire fail in the next 1,000 miles: yes or no?
  • Which brings in more referrals: a $10 credit or a 15% discount?

Multiclass (multinomial) classification algorithms divide the data into three or more categories. They’re useful for questions that have three or more possible answers that are mutually exclusive. For example:

  • In which month do the majority of travelers purchase airline tickets?
  • What emotion is the person in this photo displaying?

Find unusual data points

Anomaly detection algorithms identify data points that fall outside of the defined parameters for what’s “normal.” For example, you would use anomaly detection algorithms to answer questions like:

  • Where are the defective parts in this batch?
  • Which credit card purchases might be fraudulent?

Predict values

Regression algorithms predict the value of a new data point based on historical data. They help you answer questions like:

  • How much will the average two-bedroom home cost in my city next year?
  • How many patients will come through the clinic on Tuesday?

See how values change over time

Time series algorithms show how a given value changes over time. With time series analysis and time series forecasting, data is collected at regular intervals over time and used to make predictions and identify trends, seasonality, cyclicity, and irregularity. Time series algorithms are used to answer questions like:

  • Is the price of a given stock likely to rise or fall in the coming year?
  • What will my expenses be next year?

Discover similarities

Clustering algorithms divide the data into multiple groups by determining the level of similarity between data points. Clustering algorithms work well for questions like:

  • Which viewers like the same types of movies?
  • Which printer models fail in the same way?


Classification algorithms use predictive calculations to assign data to preset categories. Classification algorithms are trained on input data, and used to answer questions like:

  • Is this email spam?
  • What is the sentiment (positive, negative, or neutral) of a given text?

Linear regression algorithms show or predict the relationship between two variable or factors by fitting a continuous straight line to the data. The line is often calculated using the Squared Error Cost function. Linear regression is one of the most popular types of regression analysis.

Logistic regression algorithms fit a continuous S-shaped curve to the data. Logistic regression is another popular type of regression analysis.

Naïve Bayes algorithms calculate the probability that an event will occur, based on the occurrence of a related event.

Support Vector Machines draw a hyperplane between the two closest data points. This marginalizes the classes and maximizes the distances between them to more clearly differentiate them.

Decision tree algorithms split the data into two or more homogeneous sets. They use if–then rules to separate the data based on the most significant differentiator between data points.

K-Nearest neighbor algorithms store all available data points and classify each new data point based on the data points that are closest to it, as measured by a distance function.

Random forest algorithms are based on decision trees, but instead of creating one tree, they create a forest of trees and then randomize the trees in that forest. Then, they aggregate votes from different random formations of the decision trees to determine the final class of the test object.

Gradient boosting algorithms produce a prediction model that bundles weak prediction models—typically decision trees—through an ensembling process that improves the overall performance of the model.

K-Means algorithms classify data into clusters—where K equals the number of clusters. The data points inside of each cluster are homogeneous, and they’re heterogeneous to data points in other clusters.

What are machine learning libraries?

A machine learning library is a set of functions, frameworks, modules, and routines written in a given language. Developers use the code in machine learning libraries as building blocks for creating machine learning solutions that can perform complex tasks. Instead of having to manually code every algorithm and formula in a machine learning solution, developers can find the functions and modules they need in one of many available ML libraries, and use those to build a solution that meets their needs.

Top 10 Machine Learning Algorithms

A machine learning algorithm is a program code (math or program logic) that enables professionals to study, analyze, comprehend and explore large complex datasets. This article explains the fundamentals of machine learning algorithms and reveals the top 10 machine learning algorithms in 2022.

Table of Contents

  • What Is a Machine Learning Algorithm?
  • Top 10 Machine Learning Algorithms in 2022

What Is a Machine Learning Algorithm?

A machine learning algorithm refers to a program code (math or program logic) that enables professionals to study, analyze, comprehend, and explore large complex datasets. Each algorithm follows a series of instructions to accomplish the objective of making predictions or categorizing information by learning, establishing, and discovering patterns embedded in the data.

Types of Machine Learning
Types of Machine Learning

Machine learning algorithms specify rules and processes that a system should consider while addressing a specific problem. These algorithms analyze and simulate data to predict the result within a predetermined range. Moreover, as new data is fed into these algorithms, they learn, optimize, and improve based on the feedback on previous performance in predicting outcomes. In simple words, machine learning algorithms tend to become ‘smarter’ with each iteration.

Depending on the type of algorithm, machine learning models use several parameters such as gamma parameter, max_depth, n_neighbors, and others to analyze data and produce accurate results. These parameters are a consequence of training data that represents a larger dataset.

Machine learning algorithms are classified into four types based on the learning techniques: supervised, semi-supervised, unsupervised, and reinforcement learning. Regression and classification algorithms are the most popular options for predicting values, identifying similarities, and discovering unusual data patterns.

1. Supervised learning

Supervised learning algorithms use labeled datasets to make predictions. This learning technique is beneficial when you know the kind of result or outcome you intend to have.

For example, consider that you have a dataset that specifies the rain that occurred in a geographic area during a particular season over the past 200 years. You intend to know the expected rain during that specific season for the next ten years. Here, the outcome is derived based on the labels existing in the original dataset, i.e., rainfall, geographic area, season, and year.

2. Unsupervised learning

Unsupervised learning algorithms use unlabeled data. This learning technique labels the unlabeled data by categorizing the data or expressing its type, form, or structure. This technique comes in handy when the result type is unknown.

For example, when you use a dataset of Facebook users, you intend to classify users who show inclination (based on likes) toward similar Facebook ad campaigns. In this case, the dataset is unlabeled. However, the result will have labels as the algorithm will find similarities between data points while classifying the users.

3. Semi-supervised learning (SSL)

Semi-supervised learning algorithms combine the above two, where labeled and unlabeled data are used. The objective of these algorithms is to categorize unlabeled data based on the information derived from labeled data.

Consider the example of web content classification. Categorizing and classifying the content available on the internet is a time- and resource-intensive task. Apart from AI algorithms, it requires human resources to organize billions of web pages available online. In such cases, SSL models can play a crucial role in accomplishing the task efficiently.

4. Reinforcement learning

Reinforcement learning algorithms use the result or outcome as a benchmark to decide the next action step. In other words, these algorithms learn from previous outcomes, receive feedback after every step, and then decide whether to go ahead with the next step or not. The system learns whether it made a right, wrong, or neutral choice in the process. Automated systems can employ reinforcement learning as they are designed to make decisions with minimal human intervention.

For example, you design a self-driving car and intend to track whether the car is following traffic rules and ensuring safety on the roads. By applying reinforcement learning, the vehicle learns through experience and reinforcement tactics. The algorithm ensures that the car obeys traffic laws of staying in one lane, follows speed limits, and stops encountering pedestrians or animals on the road.

See More: What Is Artificial Intelligence (AI) as a Service? Definition, Architecture, and Trends

Top 10 Machine Learning Algorithms in 2022

Machine learning has significantly impacted our daily lives. Machine learning is omnipresent from smart assistants scheduling appointments, playing songs, and notifying users based on calendar events to NLP-based voice assistants. All such intelligent systems operate on machine learning algorithms.

In data science, each machine learning algorithm handles a specific problem. In some cases, professionals tend to opt for a combination of these algorithms as one algorithm may not be able to solve a particular problem.

Here, we look at the top 10 machine learning algorithms that are frequently used to achieve actual results.

1. Linear regression

Linear regression gives a relationship between input (x) and an output variable (y), also referred to as independent and dependent variables. Let’s understand the algorithm with an example where you are required to arrange a few plastic boxes of different sizes on separate shelves based on their corresponding weights.

The task is to be completed without manually weighing the boxes. Instead, you need to guess the weight just by observing the boxes’ height, dimensions, and sizes. In short, the entire task is driven based on visual analysis. Thus, you have to use a combination of visible variables to make the final arrangement on the shelves.

Linear regression in machine learning is of a similar kind, where the relationship between independent and dependent variables is established by fitting them to a regression line. This line has a mathematical representation given by the linear equation y = mx + c, where y represents the dependent variable, m = slope, x = independent variable, and b = intercept.

The objective of linear regression is to find the best fit line that reveals the relationship between variables y and x.

2. Logistic regression

The dependent variable is of binary type (dichotomous) in logistic regression. This type of regression analysis describes data and explains the relationship between one dichotomous variable and one or more independent variables.

Logistic regression is used in predictive analysis where pertinent data predict an event probability to a logit function. Thus, it is also called logit regression.

Mathematically, logistic regression is represented by the equation:

y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))


x = input value, y = predicted output, b0 = bias or intercept term, b1 = coefficient for input (x).

Logistic regression could be used to predict whether a particular team will win (1) the FIFA World Cup 2022 or not (0), or whether a lockdown will be imposed (1) due to rising COVID-19 cases or not (0). Thus, the binary outcomes of logistic regression facilitate faster decision-making as you only need to pick one out of the two alternatives.

3. Decision trees

With a decision tree, you can visualize the map of potential results for a series of decisions. It enables companies to compare possible outcomes and then take a straightforward decision based on parameters such as advantages and probabilities that are beneficial to them.

Decision tree algorithms can potentially anticipate the best option based on a mathematical construct and also come in handy while brainstorming over a specific decision. The tree starts with a root node (decision node) and then branches into sub-nodes representing potential outcomes.

Each outcome can further create child nodes that can open up other possibilities. The algorithm generates a tree-like structure that is used for classification problems. For example, consider the decision tree below that helps finalize a weekend plan based on the weather forecast.

Decision Tree
Decision Tree

4. Support vector machines (SVMs)

Support vector machine algorithms are used to accomplish both classification and regression tasks. These are supervised machine learning algorithms that plot each piece of data in the n-dimensional space, with n referring to the number of features. Each feature value is associated with a coordinate value, making it easier to plot the features.

Moreover, classification is further performed by distinctly determining the hyper-plane that separates the two sets of support vectors or classes. A good separation ensures a good classification between the plotted data points.

Support Vector Machines
Support Vector Machines

In simple words, SVMs represent the coordinates for individual observations. These are popular machine learning classifiers used in applications such as data classification, facial expression classification, text classification, steganography detection in digital images, speech recognition, and others.

5. Naive Bayes algorithm

Naive Bayes refers to a probabilistic machine learning algorithm based on the Bayesian probability model and is used to address classification problems. The fundamental assumption of the algorithm is that features under consideration are independent of each other and a change in the value of one does not impact the value of the other.

For example, you can consider a ball, a cricket ball, if it is red, round, has a 7.1-7.26 cm diameter, and has a mass of 156-163 g. Although all these features could be interdependent, each one contributes to the probability that it is a cricket ball. This is the reason the algorithm is referred to as ‘naïve’.

Let’s look at the mathematical representation of the algorithm.

If X, Y = probabilistic events, P (X) = probability of X being true, P(X|Y) = conditional probability of X being true in case Y is true.

Then, Bayes’ theorem is given by the equation:

P (X|Y) = (P (Y|X) x P (X)) /P (Y)

A naive Bayesian approach is easy to develop and implement. It is capable of handling massive datasets and is useful for making real-time predictions. Its applications include spam filtering, sentiment analysis and prediction, document classification, and others.

6. KNN classification algorithm

The K Nearest Neighbors (KNN) algorithm is used for both classification and regression problems. It stores all the known use cases and classifies new use cases (or data points) by segregating them into different classes. This classification is accomplished based on the similarity score of the recent use cases to the available ones.

KNN is a supervised machine learning algorithm, wherein ‘K’ refers to the number of neighboring points we consider while classifying and segregating the known n groups. The algorithm learns at each step and iteration, thereby eliminating the need for any specific learning phase. The classification is based on the neighbor’s majority vote.

The algorithm uses these steps to perform the classification:

  • For a training dataset, calculate the distance between the data points that are to be classified and the rest of the data points.
  • Choose the closest ‘K’ elements based on the distance or function used.
  • Consider a ‘majority vote’ between the K points–the class or label dominating all data points reveals the final ranking.

Real-life applications of KNN algorithms include facial recognition, text mining, and recommendation systems such as Amazon, Netflix, and others.

7. K-Means

K-Means is a distance-based unsupervised machine learning algorithm that accomplishes clustering tasks. In this algorithm, you classify datasets into clusters (K clusters) where the data points within one set remain homogenous, and the data points from two different clusters remain heterogeneous.

The clusters under K-Means are formed using these steps:

  • Initialization: The K-means algorithm selects centroids for each cluster (‘K’ number of points).
  • Assign objects to centroid: Clusters are formed with the closest centroids (K clusters) at each data point.
  • Centroid update: Create new centroids based on existing clusters and determine the closest distance for each data point based on new centroids. Here, the position of the centroid also gets updated whenever required.
  • Repeat: Repeat the process till the centroids do not change.
K- Means

K-Means clustering is useful in applications such as clustering Facebook users with common likes and dislikes, document clustering, segmenting customers who buy similar ecommerce products, etc.

8. Random forest algorithm

Random forest algorithms use multiple decision trees to handle classification and regression problems. It is a supervised machine learning algorithm where different decision trees are built on different samples during training. These algorithms help estimate missing data and tend to keep the accuracy intact in situations when a large chunk of data is missing in the dataset.

Random forest algorithms follow these steps:

  • Select random data samples from a given data set.
  • Build a decision tree for each data sample and provide the prediction result for each decision tree.
  • Carry out voting for each expected result.
  • Select the final prediction result based on the highest voted prediction result.
Random Forest Algorithm
Random Forest Algorithm

This algorithm finds applications in finance, ecommerce (recommendation engines), computational biology (gene classification, biomarker discovery), and others.

9. Artificial neural networks (ANNs)

Artificial neural networks are machine learning algorithms that mimic the human brain (neuronal behavior and connections) to solve complex problems. ANN has three or more interconnected layers in its computational model that process the input data.

The first layer is the input layer or neurons that send input data to deeper layers. The second layer is called the hidden layer. The components of this layer change or tweak the information received through various previous layers by performing a series of data transformations. These are also called neural layers. The third layer is the output layer that sends the final output data for the problem.

ANN algorithms find applications in smart home and home automation devices such as door locks, thermostats, smart speakers, lights, and appliances. They are also used in the field of computational vision, specifically in detection systems and autonomous vehicles.

10. Recurrent neural networks (RNNs)

Recurrent neural networks refer to a specific type of ANN that processes sequential data. Here, the result of the previous step acts as the input to the current step. This is facilitated via the hidden state that remembers information about a sequence. It acts as a memory that maintains the information on what was previously calculated. The memory of RNN reduces the overall complexity of the neural network.

Recurrent Neural Networks
Recurrent Neural Network

RNN analyzes time series data and possesses the ability to store, learn, and maintain contexts of any length. RNN is used in cases where time sequence is of paramount importance, such as speech recognition, language translation, video frame processing, text generation, and image captioning. Even Siri, Google Assistant, and Google Translate use the RNN architecture.

See More: What Is Logistic Regression? Equation, Assumptions, Types, and Best Practices


Machine learning algorithms tend to learn from observations. They analyze data, map input to output, and detect data patterns. The algorithms become smarter as they process more data, improving overall predictive performance.

Depending on the changing requirements and the complexity of the problems, new variants of existing machine learning algorithms continue to emerge. You can choose the algorithm that best suits your needs and get a head start on machine learning.

Top 10 Machine Learning Algorithms For Beginners: Supervised, and More

Table of Contents

What‌ ‌Are‌ ‌The‌ ‌10 ‌Popular‌ ‌Machine‌ ‌Learning Algorithms?‌

How Learning These Vital Algorithms Can Enhance Your Skills in Machine Learning

List of Popular Machine Learning Algorithms


In a world where nearly all manual tasks are being automated, the definition of manual is changing. There are now many different types of Machine Learning algorithms, some of which can help computers play chess, perform surgeries, and get smarter and more personal.

We are living in an era of constant technological progress, and looking at how computing has advanced over the years, we can predict what’s to come in the days ahead. 

One of the main features of this revolution that stands out is how computing tools and techniques have been democratized. Data scientists have built sophisticated data-crunching machines in the last 5 years by seamlessly executing advanced techniques. The results have been astounding. 

The many different types of machine learning algorithms have been designed in such dynamic times to help solve real-world complex problems. The ml algorithms are automated and self-modifying to continue improving over time. Before we delve into the top 10 machine learning algorithms you should know, let’s take a look at the different types of machine learning algorithms and how they are classified. 

Machine learning algorithms are classified into 4 types:

  • Supervised
  • Unsupervised Learning
  • Semi-supervised Learning
  • Reinforcement Learning

Read More: Supervised and Unsupervised Learning in Machine Learning

However, these four types of ml algorithms are further classified into more types. 

Below is the list of Top 10 commonly used Machine Learning (ML) Algorithms:

  • Linear regression
  • Logistic regression
  • Decision tree
  • SVM algorithm
  • Naive Bayes algorithm
  • KNN algorithm
  • K-means
  • Random forest algorithm
  • Dimensionality reduction algorithms
  • Gradient boosting algorithm and AdaBoosting algorithm

Read More: How to Become a Machine Learning Engineer?

How Learning These Vital Algorithms Can Enhance Your Skills in Machine Learning

If you’re a data scientist or a machine learning enthusiast, you can use these techniques to create functional Machine Learning projects.

There are three types of most popular Machine Learning algorithms, i.e – supervised learning, unsupervised learning, and reinforcement learning. All three techniques are used in this list of 10 common Machine Learning Algorithms: 

Machine Learning Algorithms

Also Read: Training for a Career in AI & Machine Learning

1. Linear Regression

To understand the working functionality of Linear Regression, imagine how you would arrange random logs of wood in increasing order of their weight. There is a catch; however – you cannot weigh each log. You have to guess its weight just by looking at the height and girth of the log (visual analysis) and arranging them using a combination of these visible parameters. This is what linear regression in machine learning is like.

In this process, a relationship is established between independent and dependent variables by fitting them to a line. This line is known as the regression line and is represented by a linear equation Y= a *X + b.

In this equation:

  • Y – Dependent Variable
  • a – Slope
  • X – Independent variable
  • b – Intercept

The coefficients a & b are derived by minimizing the sum of the squared difference of distance between data points and the regression line.

2. Logistic Regression

Logistic Regression is used to estimate discrete values (usually binary values like 0/1) from a set of independent variables. It helps predict the probability of an event by fitting data to a logit function. It is also called logit regression.

These methods listed below are often used to help improve logistic regression models:

  • include interaction terms
  • eliminate features
  • regularize techniques
  • use a non-linear model

3. Decision Tree

Decision Tree algorithm in machine learning is one of the most popular algorithm in use today; this is a supervised learning algorithm that is used for classifying problems. It works well in classifying both categorical and continuous dependent variables. This algorithm divides the population into two or more homogeneous sets based on the most significant attributes/ independent variables.

4. SVM (Support Vector Machine) Algorithm

SVM algorithm is a method of a classification algorithm in which you plot raw data as points in an n-dimensional space (where n is the number of features you have). The value of each feature is then tied to a particular coordinate, making it easy to classify the data. Lines called classifiers can be used to split the data and plot them on a graph.

5. Naive Bayes Algorithm

A Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

Even if these features are related to each other, a Naive Bayes classifier would consider all of these properties independently when calculating the probability of a particular outcome.

A Naive Bayesian model is easy to build and useful for massive datasets. It’s simple and is known to outperform even highly sophisticated classification methods.

6. KNN (K- Nearest Neighbors) Algorithm

This algorithm can be applied to both classification and regression problems. Apparently, within the Data Science industry, it’s more widely used to solve classification problems. It’s a simple algorithm that stores all available cases and classifies any new cases by taking a majority vote of its k neighbors. The case is then assigned to the class with which it has the most in common. A distance function performs this measurement.

KNN can be easily understood by comparing it to real life. For example, if you want information about a person, it makes sense to talk to his or her friends and colleagues!

Things to consider before selecting K Nearest Neighbours Algorithm: 

  • KNN is computationally expensive
  • Variables should be normalized, or else higher range variables can bias the algorithm
  • Data still needs to be pre-processed.

7. K-Means

It is an unsupervised learning algorithm that solves clustering problems. Data sets are classified into a particular number of clusters (let’s call that number K) in such a way that all the data points within a cluster are homogenous and heterogeneous from the data in other clusters.

How K-means forms clusters:

  • The K-means algorithm picks k number of points, called centroids, for each cluster.
  • Each data point forms a cluster with the closest centroids, i.e., K clusters.
  • It now creates new centroids based on the existing cluster members.
  • With these new centroids, the closest distance for each data point is determined. This process is repeated until the centroids do not change.

8. Random Forest Algorithm

A collective of decision trees is called a Random Forest. To classify a new object based on its attributes, each tree is classified, and the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest).

Each tree is planted & grown as follows:

  • If the number of cases in the training set is N, then a sample of N cases is taken at random. This sample will be the training set for growing the tree.
  • If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M, and the best split on this m is used to split the node. The value of m is held constant during this process.
  • Each tree is grown to the most substantial extent possible. There is no pruning. 

9. Dimensionality Reduction Algorithms

In today’s world, vast amounts of data are being stored and analyzed by corporates, government agencies, and research organizations. As a data scientist, you know that this raw data contains a lot of information – the challenge is to identify significant patterns and variables.

Dimensionality reduction algorithms like Decision Tree, Factor Analysis, Missing Value Ratio, and Random Forest can help you find relevant details.

10. Gradient Boosting Algorithm and AdaBoosting Algorithm

Gradient Boosting Algorithm and AdaBoosting Algorithm are boosting algorithms used when massive loads of data have to be handled to make predictions with high accuracy. Boosting is an ensemble learning algorithm that combines the predictive power of several base estimators to improve robustness.

In short, it combines multiple weak or average predictors to build a strong predictor. These boosting algorithms always work well in data science competitions like Kaggle, AV Hackathon, CrowdAnalytix. These are the most preferred machine learning algorithms today. Use them, along with Python and R Codes, to achieve accurate outcomes.


If you want to build a career in machine learning, start right away. The field is increasing, and the sooner you understand the scope of machine learning tools, the sooner you’ll be able to provide solutions to complex work problems. However, if you are experienced in the field and want to boost your career, you can take-up the Post Graduate Program in AI and Machine Learning in partnership with Purdue University collaborated with IBM. This program gives you an in-depth knowledge of Python, Deep Learning algorithm with the Tensor flow, Natural Language Processing, Speech Recognition, Computer Vision, and Reinforcement Learning.

Also, prepare yourself for Machine Learning interview questions to land at your dream job!

The 10 Best Machine Learning Algorithms for Data Science Beginners

Interest in learning machine learning has skyrocketed in the years since Harvard Business Review article named ‘Data Scientist’ the ‘Sexiest job of the 21st century’. But if you’re just starting out in machine learning, it can be a bit difficult to break into. That’s why we’re rebooting our immensely popular post about good machine learning algorithms for beginners.

(This post was originally published on KDNuggets as The 10 Algorithms Machine Learning Engineers Need to Know. It has been reposted with perlesson, and was last updated in 2019).

This post is targeted towards beginners. If you’ve got some experience in data science and machine learning, you may be more interested in this more in-depth tutorial on doing machine learning in Python with scikit-learn, or in our machine learning courses, which start here. If you’re not clear yet on the differences between “data science” and “machine learning,” this article offers a good explanation: machine learning and data science — what makes them different?

Machine learning algorithms are programs that can learn from data and improve from experience, without human intervention. Learning tasks may include learning the function that maps the input to the output, learning the hidden structure in unlabeled data; or ‘instance-based learning’, where a class label is produced for a new instance by comparing the new instance (row) to instances from the training data, which were stored in memory. ‘Instance-based learning’ does not create an abstraction from specific instances.

Types of Machine Learning Algorithms

There are 3 types of machine learning (ML) algorithms:

Supervised Learning Algorithms:

Supervised learning uses labeled training data to learn the mapping function that turns input variables (X) into the output variable (Y). In other words, it solves for f in the following equation:Y = f (X)

This allows us to accurately generate outputs when given new inputs.

We’ll talk about two types of supervised learning: classification and regression.

Classification is used to predict the outcome of a given sample when the output variable is in the form of categories. A classification model might look at the input data and try to predict labels like “sick” or “healthy.”

Regression is used to predict the outcome of a given sample when the output variable is in the form of real values. For example, a regression model might process input data to predict the amount of rainfall, the height of a person, etc.

The first 5 algorithms that we cover in this blog – Linear Regression, Logistic Regression, CART, Naïve-Bayes, and K-Nearest Neighbors (KNN) — are examples of supervised learning.

Ensembling is another type of supervised learning. It means combining the predictions of multiple machine learning models that are individually weak to produce a more accurate prediction on a new sample. Algorithms 9 and 10 of this article — Bagging with Random Forests, Boosting with XGBoost — are examples of ensemble techniques.

Unsupervised Learning Algorithms:

Unsupervised learning models are used when we only have the input variables (X) and no corresponding output variables. They use unlabeled training data to model the underlying structure of the data.

We’ll talk about three types of unsupervised learning:

Association is used to discover the probability of the co-occurrence of items in a collection. It is extensively used in market-basket analysis. For example, an association model might be used to discover that if a customer purchases bread, s/he is 80% likely to also purchase eggs.

Clustering is used to group samples such that objects within the same cluster are more similar to each other than to the objects from another cluster.

Dimensionality Reduction is used to reduce the number of variables of a data set while ensuring that important information is still conveyed. Dimensionality Reduction can be done using Feature Extraction methods and Feature Selection methods. Feature Selection selects a subset of the original variables. Feature Extraction performs data transformation from a high-dimensional space to a low-dimensional space. Example: PCA algorithm is a Feature Extraction approach.

Algorithms 6-8 that we cover here — Apriori, K-means, PCA — are examples of unsupervised learning.

Reinforcement learning:

Reinforcement learning is a type of machine learning algorithm that allows an agent to decide the best next action based on its current state by learning behaviors that will maximize a reward.

Reinforcement algorithms usually learn optimal actions through trial and error. Imagine, for example, a video game in which the player needs to move to certain places at certain times to earn points. A reinforcement algorithm playing that game would start by moving randomly but, over time through trial and error, it would learn where and when it needed to move the in-game character to maximize its point total.

Quantifying the Popularity of Machine Learning Algorithms

Where did we get these ten algorithms? Any such list will be inherently subjective. Studies such as these have quantified the 10 most popular data mining algorithms, but they’re still relying on the subjective responses of survey responses, usually advanced academic practitioners. For example, in the study linked above, the persons polled were the winners of the ACM KDD Innovation Award, the IEEE ICDM Research Contributions Award; the Program Committee members of the KDD ’06, ICDM ’06, and SDM ’06; and the 145 attendees of the ICDM ’06.

The top 10 algorithms listed in this post are chosen with machine learning beginners in mind. They are are primarily algorithms I learned from the ‘Data Warehousing and Mining’ (DWM) course during my Bachelor’s degree in Computer Engineering at the University of Mumbai. I have included the last 2 algorithms (ensemble methods) particularly because they are frequently used to win Kaggle competitions.

Without Further Ado, The Top 10 Machine Learning Algorithms for Beginners:

1. Linear Regression

In machine learning, we have a set of input variables (x) that are used to determine an output variable (y). A relationship exists between the input variables and the output variable. The goal of ML is to quantify this relationship.

Figure 1: Linear Regression is represented as a line in the form of y = a + bx. Source

In Linear Regression, the relationship between the input variables (x) and output variable (y) is expressed as an equation of the form y = a + bx. Thus, the goal of linear regression is to find out the values of coefficients a and b. Here, a is the intercept and b is the slope of the line.

Figure 1 shows the plotted x and y values for a data set. The goal is to fit a line that is nearest to most of the points. This would reduce the distance (‘error’) between the y value of a data point and the line.

2. Logistic Regression

Linear regression predictions are continuous values (i.e., rainfall in cm), logistic regression predictions are discrete values (i.e., whether a student passed/failed) after applying a transformation function.

Logistic regression is best suited for binary classification: data sets where y = 0 or 1, where 1 denotes the default class. For example, in predicting whether an event will occur or not, there are only two possibilities: that it occurs (which we denote as 1) or that it does not (0). So if we were predicting whether a patient was sick, we would label sick patients using the value of 1 in our data set.

Logistic regression is named after the transformation function it uses, which is called the logistic function h(x)= 1/ (1 + ex). This forms an S-shaped curve.

In logistic regression, the output takes the form of probabilities of the default class (unlike linear regression, where the output is directly produced). As it is a probability, the output lies in the range of 0-1. So, for example, if we’re trying to predict whether patients are sick, we already know that sick patients are denoted as 1, so if our algorithm assigns the score of 0.98 to a patient, it thinks that patient is quite likely to be sick.

This output (y-value) is generated by log transforming the x-value, using the logistic function h(x)= 1/ (1 + e^ -x) . A threshold is then applied to force this probability into a binary classification.

Figure 2: Logistic Regression to determine if a tumor is malignant or benign. Classified as malignant if the probability h(x)>= 0.5. Source

In Figure 2, to determine whether a tumor is malignant or not, the default variable is y = 1 (tumor = malignant). The x variable could be a measurement of the tumor, such as the size of the tumor. As shown in the figure, the logistic function transforms the x-value of the various instances of the data set, into the range of 0 to 1. If the probability crosses the threshold of 0.5 (shown by the horizontal line), the tumor is classified as malignant.

The logistic regression equation P(x) = e ^ (b0 +b1x) / (1 + e(b0 + b1x)) can be transformed into ln(p(x) / 1-p(x)) = b0 + b1x.

The goal of logistic regression is to use the training data to find the values of coefficients b0 and b1 such that it will minimize the error between the predicted outcome and the actual outcome. These coefficients are estimated using the technique of Maximum Likelihood Estimation.


Classification and Regression Trees (CART) are one implementation of Decision Trees.

The non-terminal nodes of Classification and Regression Trees are the root node and the internal node. The terminal nodes are the leaf nodes. Each non-terminal node represents a single input variable (x) and a splitting point on that variable; the leaf nodes represent the output variable (y). The model is used as follows to make predictions: walk the splits of the tree to arrive at a leaf node and output the value present at the leaf node.

The decision tree in Figure 3 below classifies whether a person will buy a sports car or a minivan depending on their age and marital status. If the person is over 30 years and is not married, we walk the tree as follows : ‘over 30 years?’ -> yes -> ’married?’ -> no. Hence, the model outputs a sports car.

Figure 3: Parts of a decision tree. Source

4. Naïve Bayes

To calculate the probability that an event will occur, given that another event has already occurred, we use Bayes’s Theorem. To calculate the probability of hypothesis(h) being true, given our prior knowledge(d), we use Bayes’s Theorem as follows:P(h|d)= (P(d|h) P(h)) / P(d)


  • P(h|d) = Posterior probability. The probability of hypothesis h being true, given the data d, where P(h|d)= P(d1| h) P(d2| h)….P(dn| h) P(d)
  • P(d|h) = Likelihood. The probability of data d given that the hypothesis h was true.
  • P(h) = Class prior probability. The probability of hypothesis h being true (irrespective of the data)
  • P(d) = Predictor prior probability. Probability of the data (irrespective of the hypothesis)

This algorithm is called ‘naive’ because it assumes that all the variables are independent of each other, which is a naive assumption to make in real-world examples.Naive-Bayes

Figure 4: Using Naive Bayes to predict the status of ‘play’ using the variable ‘weather’.

Using Figure 4 as an example, what is the outcome if weather = ‘sunny’?

To determine the outcome play = ‘yes’ or ‘no’ given the value of variable weather = ‘sunny’, calculate P(yes|sunny) and P(no|sunny) and choose the outcome with higher probability.

->P(yes|sunny)= (P(sunny|yes) * P(yes)) / P(sunny) = (3/9 * 9/14 ) / (5/14) = 0.60

-> P(no|sunny)= (P(sunny|no) * P(no)) / P(sunny) = (2/5 * 5/14 ) / (5/14) = 0.40

Thus, if the weather = ‘sunny’, the outcome is play = ‘yes’.

5. KNN

The K-Nearest Neighbors algorithm uses the entire data set as the training set, rather than splitting the data set into a training set and test set.

When an outcome is required for a new data instance, the KNN algorithm goes through the entire data set to find the k-nearest instances to the new instance, or the k number of instances most similar to the new record, and then outputs the mean of the outcomes (for a regression problem) or the mode (most frequent class) for a classification problem. The value of k is user-specified.

The similarity between instances is calculated using measures such as Euclidean distance and Hamming distance.

Unsupervised learning algorithms

6. Apriori

The Apriori algorithm is used in a transactional database to mine frequent item sets and then generate association rules. It is popularly used in market basket analysis, where one checks for combinations of products that frequently co-occur in the database. In general, we write the association rule for ‘if a person purchases item X, then he purchases item Y’ as : X -> Y.

Example: if a person purchases milk and sugar, then she is likely to purchase coffee powder. This could be written in the form of an association rule as: {milk,sugar} -> coffee powder. Association rules are generated after crossing the threshold for support and confidence.Formulae-for-support

Figure 5: Formulae for support, confidence and lift for the association rule X->Y.

The Support measure helps prune the number of candidate item sets to be considered during frequent item set generation. This support measure is guided by the Apriori principle. The Apriori principle states that if an itemset is frequent, then all of its subsets must also be frequent.

7. K-means

K-means is an iterative algorithm that groups similar data into clusters.It calculates the centroids of k clusters and assigns a data point to that cluster having least distance between its centroid and the data point.k-means-algorithm

Figure 6: Steps of the K-means algorithm. Source

Here’s how it works:

We start by choosing a value of k. Here, let us say k = 3. Then, we randomly assign each data point to any of the 3 clusters. Compute cluster centroid for each of the clusters. The red, blue and green stars denote the centroids for each of the 3 clusters.

Next, reassign each point to the closest cluster centroid. In the figure above, the upper 5 points got assigned to the cluster with the blue centroid. Follow the same procedure to assign points to the clusters containing the red and green centroids.

Then, calculate centroids for the new clusters. The old centroids are gray stars; the new centroids are the red, green, and blue stars.

Finally, repeat steps 2-3 until there is no switching of points from one cluster to another. Once there is no switching for 2 consecutive steps, exit the K-means algorithm.

8. PCA

Principal Component Analysis (PCA) is used to make data easy to explore and visualize by reducing the number of variables. This is done by capturing the maximum variance in the data into a new coordinate system with axes called ‘principal components’.

Each component is a linear combination of the original variables and is orthogonal to one another. Orthogonality between components indicates that the correlation between these components is zero.

The first principal component captures the direction of the maximum variability in the data. The second principal component captures the remaining variance in the data but has variables uncorrelated with the first component. Similarly, all successive principal components (PC3, PC4 and so on) capture the remaining variance while being uncorrelated with the previous component.PCA

Figure 7: The 3 original variables (genes) are reduced to 2 new variables termed principal components (PC’s). Source

Ensemble learning techniques:

Ensembling means combining the results of multiple learners (classifiers) for improved results, by voting or averaging. Voting is used during classification and averaging is used during regression. The idea is that ensembles of learners perform better than single learners.

There are 3 types of ensembling algorithms: Bagging, Boosting and Stacking. We are not going to cover ‘stacking’ here, but if you’d like a detailed explanation of it, here’s a solid introduction from Kaggle.

9. Bagging with Random Forests

The first step in bagging is to create multiple models with data sets created using the Bootstrap Sampling method. In Bootstrap Sampling, each generated training set is composed of random subsamples from the original data set.

Each of these training sets is of the same size as the original data set, but some records repeat multiple times and some records do not appear at all. Then, the entire original data set is used as the test set. Thus, if the size of the original data set is N, then the size of each generated training set is also N, with the number of unique records being about (2N/3); the size of the test set is also N.

The second step in bagging is to create multiple models by using the same algorithm on the different generated training sets.

This is where Random Forests enter into it. Unlike a decision tree, where each node is split on the best feature that minimizes error, in Random Forests, we choose a random selection of features for constructing the best split. The reason for randomness is: even with bagging, when decision trees choose the best feature to split on, they end up with similar structure and correlated predictions. But bagging after splitting on a random subset of features means less correlation among predictions from subtrees.

The number of features to be searched at each split point is specified as a parameter to the Random Forest algorithm.

Thus, in bagging with Random Forest, each tree is constructed using a random sample of records and each split is constructed using a random sample of predictors.

10. Boosting with AdaBoost

Adaboost stands for Adaptive Boosting. Bagging is a parallel ensemble because each model is built independently. On the other hand, boosting is a sequential ensemble where each model is built based on correcting the misclassifications of the previous model.

Bagging mostly involves ‘simple voting’, where each classifier votes to obtain a final outcome– one that is determined by the majority of the parallel models; boosting involves ‘weighted voting’, where each classifier votes to obtain a final outcome which is determined by the majority– but the sequential models were built by assigning greater weights to misclassified instances of the previous models.Adaboost

Figure 9: Adaboost for a decision tree. Source

In Figure 9, steps 1, 2, 3 involve a weak learner called a decision stump (a 1-level decision tree making a prediction based on the value of only 1 input feature; a decision tree with its root immediately connected to its leaves).

The process of constructing weak learners continues until a user-defined number of weak learners has been constructed or until there is no further improvement while training. Step 4 combines the 3 decision stumps of the previous models (and thus has 3 splitting rules in the decision tree).

First, start with one decision tree stump to make a decision on one input variable.

The size of the data points show that we have applied equal weights to classify them as a circle or triangle. The decision stump has generated a horizontal line in the top half to classify these points. We can see that there are two circles incorrectly predicted as triangles. Hence, we will assign higher weights to these two circles and apply another decision stump.

Second, move to another decision tree stump to make a decision on another input variable.

We observe that the size of the two misclassified circles from the previous step is larger than the remaining points. Now, the second decision stump will try to predict these two circles correctly.

As a result of assigning higher weights, these two circles have been correctly classified by the vertical line on the left. But this has now resulted in misclassifying the three circles at the top. Hence, we will assign higher weights to these three circles at the top and apply another decision stump.

Third, train another decision tree stump to make a decision on another input variable.

The three misclassified circles from the previous step are larger than the rest of the data points. Now, a vertical line to the right has been generated to classify the circles and triangles.

Fourth, Combine the decision stumps.

We have combined the separators from the 3 previous models and observe that the complex rule from this model classifies data points correctly as compared to any of the individual weak learners.


To recap, we have covered some of the the most important machine learning algorithms for data science:

  • 5 supervised learning techniques- Linear Regression, Logistic Regression, CART, Naïve Bayes, KNN.
  • 3 unsupervised learning techniques- Apriori, K-means, PCA.
  • 2 ensembling techniques- Bagging with Random Forests, Boosting with XGBoost.

Editor’s note: This was originally posted on KDNuggets, and has been reposted with perlesson. Author Reena Shaw is a developer and a data science journalist.

Commonly used Machine Learning Algorithms (with Python and R Codes)


Google’s self-driving cars and robots get a lot of press, but the company’s real future is in machine learning, the technology that enables computers to get smarter and more personal.– Eric Schmidt (Google Chairman)

We are probably living in the most defining period of human history. The period when computing moved from large mainframes to PCs to the cloud. But what makes it defining is not what has happened but what is coming our way in years to come. What makes this period exciting and enthralling for someone like me is the democratization of the various tools, techniques, and machine learning algorithms that followed the boost in computing. Welcome to the world of data science!

Today, as a data scientist, I can build data-crunching machines with complex algorithms for a few dollars per hour. But reaching here wasn’t easy! I had my dark days and nights.

Learning Objectives

  • Major focus on commonly used machine learning techniques and algorithms.
  • Algorithms covered – Linear regression, logistic regression, Naive Bayes, kNN, Random forest, etc.
  • Learn both theory and implementation of the machine learning algorithms in R and python.

Are you a beginner looking for a place to start your data science journey and learn machine learning models? Presenting a list of comprehensive courses, full of knowledge and data science learning, curated just for you to learn data science (using Python) from scratch:

  • Machine Learning Certification Course for Beginners
  • Introduction to Data Science
  • Certified AI & ML Blackbelt+ Program

Table of Contents

Who Can Benefit the Most From This Guide?

What I am giving out today is probably the most valuable guide I have ever created.

The idea behind creating this guide is to simplify the journey of aspiring data scientists and machine learning (which is part of artificial intelligence) enthusiasts across the world. Through this guide, I will enable you to work on machine-learning problems and gain from experience. I am providing a high-level understanding of various machine learning algorithms along with R & Python codes to run them. These should be sufficient to get your hands dirty. You can also check out our Machine Learning Course.

Essentials of machine learning algorithms with implementation in R and Python

I have deliberately skipped the statistics behind these techniques and artificial neural networks, as you don’t need to understand them initially. So, if you are looking for a statistical understanding of these algorithms, you should look elsewhere. But, if you want to equip yourself to start building a machine learning project, you are in for a treat.

3 Types of Machine Learning Algorithms

Supervised Learning Algorithms

How it works: This algorithm consists of a target/outcome variable (or dependent variable) which is to be predicted from a given set of predictors (independent variables). Using this set of variables, we generate a function that maps input data to desired outputs. The training process continues until the model achieves the desired level of accuracy on the training data. Examples of Supervised Learning: Regression, Decision Tree, Random Forest, KNN, Logistic Regression, etc.

Unsupervised Learning Algorithms

How it works: In this algorithm, we do not have any target or outcome variable to predict / estimate (which is called unlabelled data). It is used for recommendation systems or clustering populations in different groups. clustering algorithms are widely used for segmenting customers into different groups for specific interventions. Examples of Unsupervised Learning: Apriori algorithm, K-means clustering.

Reinforcement Learning Algorithms

How it works: Using this algorithm, the machine is trained to make specific decisions. The machine is exposed to an environment where it trains itself continually using trial and error. This machine learns from past experience and tries to capture the best possible knowledge to make accurate business decisions. Example of Reinforcement Learning: Markov Decision Process

List of Top 10 Common Machine Learning Algorithms

Here is the list of commonly used machine learning algorithms. These algorithms can be applied to almost any data problem:

  1. Linear Regression
  2. Logistic Regression
  3. Decision Tree
  4. SVM
  5. Naive Bayes
  6. kNN
  7. K-Means
  8. Random Forest
  9. Dimensionality Reduction Algorithms
  10. Gradient Boosting algorithms
    1. GBM
    2. XGBoost
    3. LightGBM
    4. CatBoost

Linear Regression

It is used to estimate real values (cost of houses, number of calls, total sales, etc.) based on a continuous variable(s). Here, we establish the relationship between independent and dependent variables by fitting the best line. This best-fit line is known as the regression line and is represented by a linear equation Y= a*X + b.

The best way to understand linear regression is to relive this experience of childhood. Let us say you ask a child in fifth grade to arrange people in his class by increasing the order of weight without asking them their weights! What do you think the child will do? He/she would likely look (visually analyze) at the height and build of people and arrange them using a combination of these visible parameters. This is linear regression in real life! The child has actually figured out that height and build would be correlated to weight by a relationship, which looks like the equation above.

In this equation:

  • Y – Dependent Variable
  • a – Slope
  • X – Independent variable
  • b – Intercept

These coefficients a and b are derived based on minimizing the sum of the squared difference of distance between data points and the regression line.

Look at the below example. Here we have identified the best-fit line having linear equation y=0.2811x+13.9. Now using this equation, we can find the weight, knowing the height of a person.

Linear_Regression | machine learning algorithms

Linear Regression is mainly of two types: Simple Linear Regression and Multiple Linear Regression. Simple Linear Regression is characterized by one independent variable. And, Multiple Linear Regression(as the name suggests) is characterized by multiple (more than 1) independent variables. While finding the best-fit line, you can fit a polynomial or curvilinear regression. And these are known as polynomial or curvilinear regression.

Here’s a coding window to try out your hand and build your own linear regression model in Python:

R Code:

#Load Train and Test datasets
#Identify feature and response variable(s) and values must be numeric and numpy arrays
x_train <- input_variables_values_training_datasets
y_train <- target_variables_values_training_datasets
x_test <- input_variables_values_test_datasets
x <- cbind(x_train,y_train)
# Train the model using the training sets and check score
linear <- lm(y_train ~ ., data = x)
#Predict Output
predicted= predict(linear,x_test)

Logistic Regression

Don’t get confused by its name! It is a classification algorithm, not a regression algorithm. It is used to estimate discrete values ( Binary values like 0/1, yes/no, true/false ) based on a given set of independent variable(s). In simple words, it predicts the probability of the occurrence of an event by fitting data to a logistic function. Hence, it is also known as logit regression. Since it predicts the probability, its output values lie between 0 and 1 (as expected).

Again, let us try and understand this through a simple example.

Let’s say your friend gives you a puzzle to solve. There are only 2 outcome scenarios – either you solve it, or you don’t. Now imagine that you are being given a wide range of puzzles/quizzes in an attempt to understand which subjects you are good at. The outcome of this study would be something like this – if you are given a trigonometry-based tenth-grade problem, you are 70% likely to solve it. On the other hand, if it is a grade fifth history question, the probability of getting an answer is only 30%. This is what Logistic Regression provides you.

Coming to the math, the log odds of the outcome are modeled as a linear combination of the predictor variables.

odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence
ln(odds) = ln(p/(1-p))
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk

Above, p is the probability of the presence of the characteristic of interest. It chooses parameters that maximize the likelihood of observing the sample values rather than that minimize the sum of squared errors (like in ordinary regression).

Now, you may ask, why take a log? For the sake of simplicity, let’s just say that this is one of the best mathematical ways to replicate a step function. I can go into more details, but that will beat the purpose of this article.

Logistic_Regression | machine learning algorithms

Build your own logistic regression model in Python here and check the accuracy:

R Code:

x <- cbind(x_train,y_train)
# Train the model using the training sets and check score
logistic <- glm(y_train ~ ., data = x,family='binomial')
#Predict Output
predicted= predict(logistic,x_test)


There are many different steps that could be tried in order to improve the model:

  • including interaction terms
  • removing features
  • regularization techniques
  • using a non-linear model

Decision Tree

This is one of my favorite algorithms, and I use it quite frequently. It is a type of supervised learning algorithm that is mostly used for classification problems. Surprisingly, it works for both categorical and continuous dependent variables. In this algorithm, we split the population into two or more homogeneous sets. This is done based on the most significant attributes/ independent variables to make as distinct groups as possible. For more details, you can read Decision Tree Simplified.

Decision tree sample | machine learning algorithms

Source: statsexchange

In the image above, you can see that population is classified into four different groups based on multiple attributes to identify ‘if they will play or not’. To split the population into different heterogeneous groups, it uses various techniques like Gini, Information Gain, Chi-square, and entropy.

The best way to understand how the decision tree works, is to play Jezzball – a classic game from Microsoft (image below). Essentially, you have a room with moving walls and you need to create walls such that the maximum area gets cleared off without the balls.

Jezzball game | machine learning algorithms

So, every time you split the room with a wall, you are trying to create 2 different populations within the same room. Decision trees work in a very similar fashion by dividing a population into as different groups as possible.

More: Simplified Version of Decision Tree Algorithms

Let’s get our hands dirty and code our own decision tree in Python!

R Code:

x <- cbind(x_train,y_train)
# grow tree 
fit <- rpart(y_train ~ ., data = x,method="class")
#Predict Output 
predicted= predict(fit,x_test)

SVM (Support Vector Machine)

It is a classification method. In this algorithm, we plot each data item as a point in n-dimensional space (where n is the number of features you have), with the value of each feature being the value of a particular coordinate.

For example, if we only had two features like the Height and Hair length of an individual, we’d first plot these two variables in two-dimensional space where each point has two coordinates (these co-ordinates are known as Support Vectors)

SVM1 | machine learning algorithms

Now, we will find some lines that split the data between the two differently classified groups of data. This will be the line such that the distances from the closest point in each of the two groups will be the farthest away. If there are more variables, a hyperplane is used to separate the classes.

SVM2 | machine learning algorithms

In the example shown above, the line which splits the data into two differently classified groups is the black line since the two closest points are the farthest apart from the line. This line is our classifier. Then, depending on where the testing data lands on either side of the line, that’s what class we can classify the new data as.

More: Simplified Version of Support Vector Machine
Think of this algorithm as playing JezzBall in n-dimensional space. The tweaks in the game are:

  • You can draw lines/planes at any angle (rather than just horizontal or vertical as in the classic game)
  • The objective of the game is to segregate balls of different colors in different rooms.
  • And the balls are not moving.

Try your hand and design an SVM model in Python through this coding window:

R Code:

x <- cbind(x_train,y_train)
# Fitting model
fit <-svm(y_train ~ ., data = x)
#Predict Output 
predicted= predict(fit,x_test)

Naive Bayes

It is a classification technique based on Bayes’ theorem with an assumption of independence between predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier would consider all of these properties to independently contribute to the probability that this fruit is an apple.

The Naive Bayesian model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x), and P(x|c). Look at the equation below:

Bayes_rule | machine learning algorithms


  • P(c|x) is the posterior probability of class (target) given predictor (attribute).
  • P(c) is the prior probability of the class.
  • P(x|c) is the likelihood which is the probability of the predictor given the class.
  • P(x) is the prior probability of the predictor.

Example: Let’s understand it using an example. Below is a training data set of weather and the corresponding target variable, ‘Play.’ Now, we need to classify whether players will play or not based on weather conditions. Let’s follow the below steps to perform it.

Step 1: Convert the data set to a frequency table.

Step 2: Create a Likelihood table by finding the probabilities like Overcast probability = 0.29 and probability of playing is 0.64.

Bayes_4 | machine learning algorithms

Step 3: Now, use the Naive Bayesian equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of the prediction.

Problem: Players will pay if the weather is sunny. Is this statement correct?

We can solve it using above discussed method, so P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)

Here we have P (Sunny | Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P(Yes)= 9/14 = 0.64

Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has a higher probability.

Naive Bayes uses a similar method to predict the probability of different classes based on various attributes. This algorithm is mostly used in text classification and with problems having multiple classes.

Code for a Naive Bayes classification model in Python:

R Code:

x <- cbind(x_train,y_train)
# Fitting model
fit <-naiveBayes(y_train ~ ., data = x)
#Predict Output 
predicted= predict(fit,x_test)

kNN (k- Nearest Neighbors)

It can be used for both classification and regression problems. However, it is more widely used in classification problems in the industry. K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. The case assigned to the class is most common amongst its K nearest neighbors measured by a distance function.

These distance functions can be Euclidean, Manhattan, Minkowski, and Hamming distances. The first three functions are used for continuous functions, and the fourth one (Hamming) for categorical variables. If K = 1, then the case is simply assigned to the class of its nearest neighbor. At times, choosing K turns out to be a challenge while performing kNN modeling.

More: Introduction to k-nearest neighbors: Simplified.

KNN | machine learning algorithms

KNN can easily be mapped to our real lives. If you want to learn about a person with whom you have no information, you might like to find out about his close friends and the circles he moves in and gain access to his/her information!

Things to consider before selecting kNN:

  • KNN is computationally expensive
  • Variables should be normalized else higher range variables can bias it
  • Works on pre-processing stage more before going for kNN like an outlier, noise removal

Python Code:

R Code:

x <- cbind(x_train,y_train)
# Fitting model
fit <-knn(y_train ~ ., data = x,k=5)
#Predict Output 
predicted= predict(fit,x_test)


It is a type of unsupervised algorithm which solves the clustering problem. Its procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters). Data points inside a cluster are homogeneous and heterogeneous to peer groups.

Remember figuring out shapes from ink blots? k means is somewhat similar to this activity. You look at the shape and spread to decipher how many different clusters/populations are present!

How K-means forms cluster:

  1. K-means picks k number of points for each cluster known as centroids.
  2. Each data point forms a cluster with the closest centroids, i.e., k clusters.
  3. Finds the centroid of each cluster based on existing cluster members. Here we have new centroids.
  4. As we have new centroids, repeat steps 2 and 3. Find the closest distance for each data point from new centroids and get associated with new k-clusters. Repeat this process until convergence occurs, i.e., centroids do not change.

How to determine the value of K:

In K-means, we have clusters, and each cluster has its own centroid. The sum of the square of the difference between the centroid and the data points within a cluster constitutes the sum of the square value for that cluster. Also, when the sum of square values for all the clusters is added, it becomes a total within the sum of the square value for the cluster solution.

We know that as the number of clusters increases, this value keeps on decreasing, but if you plot the result, you may see that the sum of squared distance decreases sharply up to some value of k and then much more slowly after that. Here, we can find the optimum number of clusters.

Kmeans | machine learning algorithms

Python Code:

R Code:

fit <- kmeans(X, 3) # 5 cluster solution

Random Forest

Random Forest is a trademarked term for an ensemble learning of decision trees. In Random Forest, we’ve got a collection of decision trees (also known as “Forest”). To classify a new object based on attributes, each tree gives a classification, and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest).

Each tree is planted & grown as follows:

  1. If the number of cases in the training set is N, then a sample of N cases is taken at random but with replacement. This sample will be the training set for growing the tree.
  2. If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M, and the best split on this m is used to split the node. The value of m is held constant during the forest growth.
  3. Each tree is grown to the largest extent possible. There is no pruning.

For more details on this algorithm, compared with the decision tree and tuning model parameters, I would suggest you read these articles:

  1. Introduction to Random forest – Simplified
  2. Comparing a CART model to Random Forest (Part 1)
  3. Comparing a Random Forest to a CART model (Part 2)
  4. Tuning the parameters of your Random Forest model

Python Code:

R Code:

x <- cbind(x_train,y_train)
# Fitting model
fit <- randomForest(Species ~ ., x,ntree=500)
#Predict Output 
predicted= predict(fit,x_test)

Dimensionality Reduction Algorithms

In the last 4-5 years, there has been an exponential increase in data capturing at every possible stage. Corporates/ Government Agencies/ Research organizations are not only coming up with new sources, but also they are capturing data in great detail.

For example, E-commerce companies are capturing more details about customers like their demographics, web crawling history, what they like or dislike, purchase history, feedback, and many others to give them personalized attention more than your nearest grocery shopkeeper.

As data scientists, the data we are offered also consists of many features, this sounds good for building a good robust model, but there is a challenge. How’d you identify highly significant variable(s) out of 1000 or 2000? In such cases, the dimensionality reduction algorithm helps us, along with various other algorithms like Decision Tree, Random Forest, PCA (principal component analysis), Factor Analysis, Identity-based on the correlation matrix, missing value ratio, and others.

To know more about these algorithms, you can read “Beginners Guide To Learn Dimension Reduction Techniques“.

Python Code:

R Code:

pca <- princomp(train, cor = TRUE)
train_reduced  <- predict(pca,train)
test_reduced  <- predict(pca,test)

Gradient Boosting Algorithms

Now, let’s look at the 4 most commonly used gradient boosting algorithms.


GBM is a boosting algorithm used when we deal with plenty of data to make a prediction with high prediction power. Boosting is actually an ensemble of learning algorithms that combines the prediction of several base estimators in order to improve robustness over a single estimator. It combines multiple weak or average predictors to build a strong predictor. These boosting algorithms always work well in data science competitions like Kaggle, AV Hackathon, and CrowdAnalytix.

More: Know about Boosting algorithms in detail
Python Code:

R Code:

x <- cbind(x_train,y_train)
# Fitting model
fitControl <- trainControl( method = "repeatedcv", number = 4, repeats = 4)
fit <- train(y ~ ., data = x, method = "gbm", trControl = fitControl,verbose = FALSE)
predicted= predict(fit,x_test,type= "prob")[,2]

GradientBoostingClassifier and Random Forest are two different boosting tree classifiers, and often people ask about the difference between these two algorithms.


Another classic gradient-boosting algorithm that’s known to be the decisive choice between winning and losing in some Kaggle competitions is the XGBoost. It has an immensely high predictive power, making it the best choice for accuracy in events. It possesses both a linear model and the tree learning algorithm, making the algorithm almost 10x faster than existing gradient booster techniques.

One of the most interesting things about the XGBoost is that it is also called a regularized boosting technique. This helps to reduce overfit modeling and has massive support for a range of languages such as Scala, Java, R, Python, Julia, and C++.

The support includes various objective functions, including regression, classification, and ranking. Supports distributed and widespread training on many machines that encompass GCE, AWS, Azure, and Yarn clusters. XGBoost can also be integrated with Spark, Flink, and other cloud dataflow systems with built-in cross-validation at each iteration of the boosting process.

Python Code:

R Code:


x <- cbind(x_train,y_train)

# Fitting model

TrainControl <- trainControl( method = "repeatedcv", number = 10, repeats = 4)

model<- train(y ~ ., data = x, method = "xgbLinear", trControl = TrainControl,verbose = FALSE)


model<- train(y ~ ., data = x, method = "xgbTree", trControl = TrainControl,verbose = FALSE)

predicted <- predict(model, x_test)


LightGBM is a gradient-boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient with the following advantages:

  • Faster training speed and higher efficiency
  • Lower memory usage
  • Better accuracy
  • Parallel and GPU learning supported
  • Capable of handling large-scale data

The framework is a fast and high-performance gradient-boosting one based on decision tree algorithms used for ranking, classification, and many other machine-learning tasks. It was developed under the Distributed Machine Learning Toolkit Project of Microsoft.

Since the LightGBM is based on decision tree algorithms, it splits the tree leaf-wise with the best fit, whereas other boosting algorithms split the tree depth-wise or level-wise rather than leaf-wise. So when growing on the same leaf node in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm, resulting in much better accuracy, which any existing boosting algorithms can rarely achieve.

Also, it is surprisingly very fast, hence the word ‘Light.’

Python Code:

data = np.random.rand(500, 10) # 500 entities, each contains 10 features
label = np.random.randint(2, size=500) # binary target

train_data = lgb.Dataset(data, label=label)
test_data = train_data.create_valid('test.svm')

param = {'num_leaves':31, 'num_trees':100, 'objective':'binary'}
param['metric'] = 'auc'

num_round = 10
bst = lgb.train(param, train_data, num_round, valid_sets=[test_data])


# 7 entities, each contains 10 features
data = np.random.rand(7, 10)
ypred = bst.predict(data)

R Code:


num_iterations <- 100
config <- list(objective = "binary",  metric="binary_logloss,auc", learning_rate = 0.1, num_leaves = 63, tree_learner = "serial", feature_fraction = 0.8, bagging_freq = 5, bagging_fraction = 0.8, min_data_in_leaf = 50, min_sum_hessian_in_leaf = 5.0)

#Create data handle and booster <-, "label", y)

handle.booster <- lgbm.booster.create(, lapply(config, as.character))

#Train for num_iterations iterations and eval every 5 steps

lgbm.booster.train(handle.booster, num_iterations, 5)

pred <- lgbm.booster.predict(handle.booster, x.test)

#Test accuracy
sum(y.test == (y.pred > 0.5)) / length(y.test)

#Save model (can be loaded again via lgbm.booster.load(filename)), filename = "/tmp/model.txt")

If you’re familiar with the Caret package in R, this is another way of implementing the LightGBM.


model <-caretModel.LGBM()

fit <- train(Species ~ ., data = iris, method=model, verbosity = 0)
y.pred <- predict(fit, iris[,1:4])

model.sparse <- caretModel.LGBM.sparse()

#Generate a sparse matrix
mat <- Matrix(as.matrix(iris[,1:4]), sparse = T)
fit <- train(data.frame(idx = 1:nrow(iris)), iris$Species, method = model.sparse, matrix = mat, verbosity = 0)


CatBoost is one of open-sourced machine learning algorithms from Yandex. It can easily integrate with deep learning frameworks like Google’s TensorFlow and Apple’s Core ML. The best part about CatBoost is that it does not require extensive data training like other ML models and can work on a variety of data formats, not undermining how robust it can be.

Catboost can automatically deal with categorical variables without showing the type conversion error, which helps you to focus on tuning your model better rather than sorting out trivial errors. Make sure you handle missing data well before you proceed with the implementation.

Python Code:

import pandas as pd
import numpy as np

from catboost import CatBoostRegressor

#Read training and testing files
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

#Imputing missing values for both train and test
train.fillna(-999, inplace=True)

#Creating a training set for modeling and validation set to check model performance
X = train.drop(['Item_Outlet_Sales'], axis=1)
y = train.Item_Outlet_Sales

from sklearn.model_selection import train_test_split

X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.7, random_state=1234)
categorical_features_indices = np.where(X.dtypes != np.float)[0]

#importing library and building model
from catboost import CatBoostRegressormodel=CatBoostRegressor(iterations=50, depth=3, learning_rate=0.1, loss_function='RMSE'), y_train,cat_features=categorical_features_indices,eval_set=(X_validation, y_validation),plot=True)

submission = pd.DataFrame()

submission['Item_Identifier'] = test['Item_Identifier']
submission['Outlet_Identifier'] = test['Outlet_Identifier']
submission['Item_Outlet_Sales'] = model.predict(test)

R Code:





tt <- titanic::titanic_train[complete.cases(titanic::titanic_train),]

data <-, stringsAsFactors = TRUE)

drop_columns = c("PassengerId", "Survived", "Name", "Ticket", "Cabin")

x <- data[,!(names(data) %in% drop_columns)]y <- data[,c("Survived")]

fit_control <- trainControl(method = "cv", number = 4,classProbs = TRUE)

grid <- expand.grid(depth = c(4, 6, 8),learning_rate = 0.1,iterations = 100, l2_leaf_reg = 1e-3,            rsm = 0.95, border_count = 64)

report <- train(x, as.factor(make.names(y)),method = catboost.caret,verbose = TRUE, preProc = NULL,tuneGrid = grid, trControl = fit_control)


importance <- varImp(report, scale = FALSE)


Practice Problems

Now, it’s time to take the plunge and actually play with some other real-world datasets. So are you ready to take on the challenge? Accelerate your data science journey with the following practice problems:


By now, I am sure you would have an idea of commonly used machine learning algorithms. My sole intention behind writing this article and providing the codes in R and Python is to get you started right away. If you are keen to master machine learning algorithms, start right away. Take up problems, develop a physical understanding of the process, apply these codes, and watch the fun!

Key Takeaways

  • We are now familiar with some of the most common ML algorithms used in the industry.
  • We’ve covered the advantages and disadvantages of various ML algorithms.
  • We’ve also learned the basic implementation details in R and Python languages.

Frequently Asked Questions

Q1. Which algorithm is mostly used in machine learning?

A. While the suitable algorithm depends on the problem, gradient-boosted decision trees are mostly used to balance performance and interpretability.

Q2. What is the difference between supervised and unsupervised ML?

A. In the supervised learning model, the labels associated with the features are given. In unsupervised learning, no labels are provided for the model.

Q3. What are the main 3 types of ML models?

A. The 3 main types of ML models are based on Supervised Learning, Unsupervised Learning, and Reinforcement Learning.