Machine learning methods

What are machine learning methods?

Machine learning methods
Source: eduCBA

Regression

Regression methods fall within the category of supervised ML. They help to predict or explain a particular numerical value based on a set of prior data, for example predicting the price of a property based on previous pricing data for similar properties.

The simplest method is linear regression where we use the mathematical equation of the line (y = m * x + b) to model a data set. We train a linear regression model with many data pairs (x, y) by calculating the position and slope of a line that minimizes the total distance between all of the data points and the line. In other words, we calculate the slope (m) and the y-intercept (b) for a line that best approximates the observations in the data.

Regression techniques run the gamut from simple (like linear regression) to complex (like regularized linear regression, polynomial regression, decision trees and random forest regressions, neural nets, among others). But don’t get bogged down: start by studying simple linear regression, master the techniques, and move on from there.

Classification

Another class of supervised ML, classification methods predict or explain a class value. For example, they can help predict whether or not an online customer will buy a product. The output can be yes or no: buyer or not buyer. But classification methods aren’t limited to two classes. For example, a classification method could help to assess whether a given image contains a car or a truck. In this case, the output will be 3 different values: 1) the image contains a car, 2) the image contains a truck, or 3) the image contains neither a car nor a truck.

The simplest classification algorithm is logistic regression — which makes it sounds like a regression method, but it’s not. Logistic regression estimates the probability of an occurrence of an event based on one or more inputs.

Clustering

With clustering methods, we get into the category of unsupervised ML because their goal is to group or cluster observations that have similar characteristics. Clustering methods don’t use output information for training, but instead let the algorithm define the output. In clustering methods, we can only use visualizations to inspect the quality of the solution.

The most popular clustering method is K-Means, where “K” represents the number of clusters that the user chooses to create. 

As you explore clustering, you’ll encounter very useful algorithms such as Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Mean Shift Clustering, Agglomerative Hierarchical Clustering, Expectation–Maximization Clustering using Gaussian Mixture Models, among others.

Dimensionality Reduction

As the name suggests, we use dimensionality reduction to remove the least important information (sometime redundant columns) from a data set. In practice, I often see data sets with hundreds or even thousands of columns (also called features), so reducing the total number is vital. For instance, images can include thousands of pixels, not all of which matter to your analysis. Or when testing microchips within the manufacturing process, you might have thousands of measurements and tests applied to every chip, many of which provide redundant information. In these cases, you need dimensionality reduction algorithms to make the data set manageable.

The most popular dimensionality reduction method isPrincipal Component Analysis (PCA), which reduces the dimension of the feature space by finding new vectors that maximize the linear variation of the data. Another popular method is t-Stochastic Neighbor Embedding (t-SNE), which does non-linear dimensionality reduction. People typically use t-SNE for data visualization, but you can also use it for machine learning tasks like reducing the feature space and clustering, to mention just a few.

Ensemble Methods

Imagine you’ve decided to build a bicycle because you are not feeling happy with the options available in stores and online. You might begin by finding the best of each part you need. Once you assemble all these great parts, the resulting bike will outshine all the other options.

Ensemble methods use this same idea of combining several predictive models (supervised ML) to get higher quality predictions than each of the models could provide on its own. For example, the Random Forest algorithms is an ensemble method that combines many Decision Trees trained with different samples of the data sets. As a result, the quality of the predictions of a Random Forest is higher than the quality of the predictions estimated with a single Decision Tree.

Think of ensemble methods as a way to reduce the variance and bias of a single machine learning model. The great majority of top winners of Kaggle competitions use ensemble methods of some kind. The most popular ensemble algorithms are Random Forest, XGBoost and LightGBM.

Neural Networks and Deep Learning

In contrast to linear and logistic regressions which are considered linear models, the objective of neural networks is to capture non-linear patterns in data by adding layers of parameters to the model. 

In fact, the structure of neural networks is flexible enough to build our well-known linear and logistic regression. The term Deep learning comes from a neural net with many hidden layers (see next Figure) and encapsulates a wide variety of architectures.

It’s especially difficult to keep up with developments in deep learning, in part because the research and industry communities have doubled down on their deep learning efforts, spawning whole new methodologies every day.

In particular, deep learning techniques have been extremely successful in the areas of vision (image classification), text, audio and video. The most common software packages for deep learning are Tensorflow and PyTorch.

Transfer Learning

Transfer Learning refers to re-using part of a previously trained neural net and adapting it to a new but similar task. Specifically, once you train a neural net using data for a task, you can transfer a fraction of the trained layers and combine them with a few new layers that you can train using the data of the new task. By adding a few layers, the new neural net can learn and adapt quickly to the new task.

The main advantage of transfer learning is that you need less data to train the neural net, which is particularly important because training for deep learning algorithms is expensive in terms of both time and money (computational resources) — and of course it’s often very difficult to find enough labeled data for the training.

Transfer learning has become more and more popular and there are now many solid pre-trained models available for common deep learning tasks like image and text classification.

Reinforcement Learning

Reinforcement Learning is a machine learning method that helps an agent learn from experience. By recording actions and using a trial-and-error approach in a set environment, RL can maximize a cumulative reward. In our example, the mouse is the agent and the maze is the environment. The set of possible actions for the mouse are: move front, back, left or right. The reward is the cheese.

You can use RL when you have little to no historical data about a problem, because it doesn’t need information in advance (unlike traditional machine learning methods). In a RL framework, you learn from the data as you go. Not surprisingly, RL is especially successful with games, especially games of “perfect information” like chess and Go. With games, feedback from the agent and the environment comes quickly, allowing the model to learn fast. The downside of RL is that it can take a very long time to train if the problem is complex.

Natural Language Processing

Natural Language Processing (NLP) is not a machine learning method per se, but rather a widely used technique to prepare text for machine learning. Think of tons of text documents in a variety of formats (word, online blogs, ….). Most of these text documents will be full of typos, missing characters and other words that needed to be filtered out. At the moment, the most popular package for processing text is NLTK (Natural Language ToolKit), created by researchers at Stanford.

Word Embeddings

TFM and TFIDF are numerical representations of text documents that only consider frequency and weighted frequencies to represent text documents. By contrast, word embeddings can capture the context of a word in a document. With the word context, embeddings can quantify the similarity between words, which in turn allows us to do arithmetic with words.

Word representations allow finding similarities between words by computing the cosine similarity between the vector representation of two words. The cosine similarity measures the angle between two vectors.

We compute word embeddings using machine learning methods, but that’s often a pre-step to applying a machine learning algorithm on top. For instance, suppose we have access to the tweets of several thousand Twitter users. Also suppose that we know which of these Twitter users bought a house. To predict the probability of a new Twitter user buying a house, we can combine Word2Vec with a logistic regression.

You can train word embeddings yourself or get a pre-trained (transfer learning) set of word vectors. To download pre-trained word vectors in 157 different languages, take a look at FastText.

Machine Learning Methods

Introduction to Machine Learning Methods

Machine Learning Methods are used to make the system learn using methods like Supervised learning and Unsupervised Learning which are further classified in methods like Classification, Regression and Clustering. This selection of methods entirely depends on the type of dataset that is available to train the model, as the dataset can be labeled, unlabelled, large. There are various applications (like image classification, Predictive analysis, Spam detection) that uses these different machine learning methods.

How do Machines learn?

There are various methods to do that. Which method to follow completely depends on the problem statement. Depending on the dataset, and our problem, there are two different ways to go deeper. One is supervised learning and the other is unsupervised learning. The following chart explains the further classification of machine learning methods. We will discuss them one by one.

Take a look at the following chart!

Machine learning methods

Let’s understand what does Supervised Learning means.

Supervised Learning

As the name suggests, imagine a teacher or a supervisor helping you to learn. The same goes for machines. We train or teach the machine using data that is labeled.

Some of the coolest supervised learning applications are:

  • Sentiment analysis (Twitter, Facebook, Netflix, YouTube, etc)
  • Natural Language Processing
  • Image classification
  • Predictive analysis
  • Pattern recognition
  • Spam detection
  • Speech/Sequence processing

Now, supervised learning is further divided into classification and regression. Let’s, understand this.

Classification

Classification is the process of finding a model that helps to separate the data into different categorical classes. In this process, data is categorized under different labels according to some parameters given in input and then the labels are predicted for the data. Categorical means the output variable is a category, i.e red or black, spam or not spam, diabetic or non-diabetic, etc.

Classification models include Support vector machine(SVM),K-nearest neighbor(KNN),Naive Bayes etc.

a) Support vector machine classifier (SVM)

SVM is a supervised learning method that looks at the data and sorts it into one of two categories. I use a hyperplane to categorize the data. A linear discriminative classifier attempts to draw a straight line separating the two sets of data and thereby create a model for classification. It simply tries to find a line or curve (in two dimensions) or a manifold (in multiple dimensions) that divides the classes from each other.

Note: For multiclass classification SVM makes use of ‘one vs rest’, that means calculating different SVM for each class.

b) K-nearest neighbor classifier (KNN)

  • If you read carefully, the name itself suggests what the algorithm does. KNN considers the data points which are closer, are much more similar in terms of features and hence more likely to belong to the same class as the neighbor. For any new data point, the distance to all other data points is calculated and the class is decided based on K nearest neighbors. Yes, it may sound lame, but for some of the classification, it works like anything.
  • A data point is classified by the maximum number vote of its neighbors, then the data point is assigned to the class nearest among its k-neighbors.
  • In KNN, no learning of the model is required and all of the work happens at the time a prediction is requested. That’s why KNN is often referred to as a lazy learning algorithm.

c) Naive Bayes classifier

  • Naive Bayes is a machine learning algorithm that is highly recommended for text classification problems. It is based on Bayes’ probability theorem. These classifiers are called naive because they assume that features variables are independent of each other. That means, for example, we have a full sentence for input, then Naive Bayes assumes every word in a sentence is independent of the other ones. And then classify them accordingly. I know, it looks pretty naive, but it’s a great choice for text classification problems and it’s a popular choice for spam email classification.
  • It provides different types of Naive Bayes Algorithms like BernoulliNB, GaussianNB, MultinomialNB.
  • It considers all the features to be unrelated, so it cannot learn the relationship between features. For example, Let’s say, Varun likes to eat burgers, he also likes to eat French fries with coke. But he doesn’t like to eat a burger and a combination of French fries with coke together. Here, Naive Bayes can not learn the relation between two features but only learns individual feature importance only.

Now let’s move on to the other side of our supervised learning method, which is a regression.

Regression

Regression is the process of finding a model that helps to differentiate the data using continuous values. In this, the nature of the predicted data is ordered. Some of the most widely used regression models include Linear regression, Random forest(Decision trees), Neural networks.

Linear regression

  • One of the simplest approaches in supervised learning, which is useful in predicting the quantitative response.
  • Linear regression includes finding the best-fitting straight line through the points. The best-fitting line is called a regression line. The best fit line doesn’t exactly pass through all the data points but instead tries it’s best to get close to them.
  • It is the widely used algorithm for continuous data. However, it only focuses on the mean of the dependent variable and limits itself to a linear relationship.
  • Linear regression can be used for Time series, trend forecasting. It can predict future sales, based on the previous data.

Unsupervised Learning

  • Unsupervised learning is based on the approach that can be thought of as the absence of a teacher and therefore of absolute error measures. It’s useful when it’s required to learn clustering or grouping of elements. Elements can be grouped (clustered) according to their similarity.
  • In unsupervised learning, data is unlabeled, not categorized and the system’s algorithms act on the data without prior training. Unsupervised learning algorithms can perform more complex tasks than supervised learning algorithms.
  • Unsupervised learning includes clustering which can be done by using K means clustering, hierarchical, Gaussian mixture, hidden Markov model.

Unsupervised Learning applications are:

  1. Similarity detection
  2. Automatic labeling
  3. Object segmentation (such as Person, Animal, Films)

Clustering

  • Clustering is an unsupervised learning technique that is used for data analytics in many fields. The clustering algorithm comes handy when we want to gain detailed insights about our data.
  • A real-world example of clustering would be Netflix’s genre clusters, which are divided for different target customers including interests, demographics, lifestyles, etc. Now you can think about how useful clustering is when companies want to understand their customer base and target new potential customers.

a) K means Clustering

  • K means clustering algorithm tries to divide the given unknown data into clusters. It randomly selects ‘k’ clusters centroid, calculates the distance between data points and clusters centroid and then finally assigns the data point to cluster centroid whose distance is minimum of all cluster centroids.
  • In k-means, groups are defined by the closest centroid for every group. This centroid acts as ‘Brain’ of the algorithm, they acquire the data points which are closest to them and then add them to the clusters.

b) Hierarchical Clustering

Hierarchical clustering is nearly similar to that of normal clustering unless you want to build a hierarchy of clusters. This can come handy when you want to decide the number of clusters. For example, suppose you are creating groups of different items on the online grocery store. On the front home page, you want a few broad items and once you click on one of the items, specific categories, that is more specific clusters opens up.

Dimensionality reduction

Dimensionality reduction can be considered as compression of a file. It means, taking out the information which is not relevant. It reduces the complexity of data and tries to keep the meaningful data. For example, in image compression, we reduce the dimensionality of the space in which the image stays as it is without destroying too much of the meaningful content in the image.

PCA for Data Visualization

Principal component analysis (PCA) is a dimension reduction method that can be useful to visualize your data. PCA is used to compress higher dimensional data to lower-dimensional data, that is, we can use PCA to reduce a four-dimensional data into three or 2 dimensions so that we can visualize and get a better understanding of the data.