In this course you will learn NLP with vector spaces.
You will
Get knowledge of
Sentiment analysis with logistic regression
Sentiment analysis with naive bayes
Vector space models
Machine translation and document search
Validate knowledge by answering a quiz by the end of each lecture
Be able to complete the course by ~2 hours.
Syllabus
Sentiment analysis with logistic regression
Supervised ML
Feature extraction
Logistic regression
Sentiment analysis with naive bayes
Bayes rule
Laplacian smoothing
Vector space models
Euclidean distance
Cosine similarity
PCA
Machine translation and document search
Word vectors
K-nearest neighbours
Approximating NN
Additional content
GPT-3
DALL-E
CLIP
Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers (such as index terms). It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System.
Supervised learning (SL) is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a “reasonable” way (see inductive bias). This statistical quality of an algorithm is measured through the so-called generalization error.
The parallel task in human and animal psychology is often referred to as concept learning.
Machine learning techniques comprise an array of computer-intensive methods that aim at discovering patterns in data using flexible, often nonparametric, methods for modeling and variable selection. These methods offer an expansion to the more traditional methods, such as OLS or logistic regression, which have been used by survey researchers and social scientists. Many of the machine learning methods do not require the distributional assumptions of the more traditional methods, and many do not require explicit model specification prior to estimation.
Machine learning methods are beginning to be used for various aspects of survey research including responsive/adaptive designs, data processing and nonresponse adjustments and weighting. This special issue aims to familiarize survey researchers and social scientists with the basic concepts in machine learning and highlights five common methods. Specifically, articles in this issue will offer an accessible introduction to: LASSO models, support vector machines, neural networks, and classification and regression trees and random forests. In addition to a detailed description, each article will highlight how the respective method is being used in survey research along with an application of the method to a common example.
The introductory article will provide an accessible introduction to some commonly used concepts and terms associated with machine learning modeling and evaluation. The introduction also provides a description of the data set that was used as the common application example for each of the five machine learning methods.
What are Machine Learning Methods?
Machine learning methods are generally flexible, nonparametric methods for making predictions or classifications from data. These methods are typically described by the algorithm that details how the predictions are made using the raw data and can allow for a larger number of predictors, referred to as high-dimensional data. These methods can often automatically detect nonlinearities in the relationships between independent and dependent variables and can identify interactions automatically. These methods can be applied to predict continuous outcomes, generally referred to as regression type problems, or to predict levels of a categorical variable, generally referred to as classification problems. Machine learning methods can also be used to group cases based on a collection of variables known for all the cases.
Types of Machine Learning Algorithms
Generally, machine learning techniques can be divided into two broad categories, supervised and unsupervised. The goal of supervised learning is to optimally predict a dependent variable (also referred to as “output,” “target,” “class,” or “label”), as a function of a range of independent variables (also referred to as “inputs,” “features,” or “attributes.”). A classical example of supervised machine learning with which survey and social scientists are familiar is ordinary least squares regression. Such a technique relies on a single (continuous) dependent variable and seeks to determine the best linear fit between this outcome and multiple independent variables. Unsupervised learning, on the other hand, is more complex, in that there is no prespecified dependent variable, and these methods focus on detecting patterns among all the variables of interest in a dataset. One of the most common unsupervised methods with which social scientists and market researchers might have some familiarity is hierarchical cluster analysis – also known as segmentation. In this case, the main interest is not on modeling an outcome based on multiple independent variables, as in regression, but rather on understanding if there are combinations of variables (e.g., demographics) that can segment or group sets of customers, respondents or members of a group, class, or city. The final output of this approach is the actual grouping of the cases within a data set, where the grouping is determined by the collection of variables available for the analysis.
Tuning Parameters for Machine Learning Methods
Unlike many traditional modeling techniques such as ordinary least squares regression, machine learning methods require a specification of hyperparameters, or tuning parameters before a final model and predictions can be obtained. These parameters are often estimated from the data prior to estimating the final model. It could be useful to think of these as settings or “knobs” on the “machine” prior to hitting the “start button” to generate the predictions. One of the simplest examples of a tuning parameter comes from K-means clustering. Prior to running a K-means clustering algorithm, the machine learning algorithm needs to know how many clusters it should produce in the end (i.e., K). The main point is that these tuning parameters are needed prior to computing final models and predictions. Many machine learning algorithms have only one such hyperparameter (e.g., K-means clustering, LASSO, tree-based models) and others require more than one (e.g., random forests, neural networks).
The Context for Machine Learning Methods: Explanation versus Prediction
Machine learning methods are algorithmic and focus on using data at hand to describe the data generating mechanism. In applying these more empirical methods in survey research, it is important to understand the distinction between models created and used for explanation versus prediction. Breiman (2001) refers to these two end goals as the two statistical modeling cultures, and Shmueli (2010) refers to them as two modeling paths. The first of these modeling paths consist of traditional methods or explanatory models that focus on explanation, while the second one consists of predictive models that focus on prediction of continuous outcomes or classification for categorical outcomes. While machine learning or algorithmic methods can be used to refine explanatory models, their most common application lies in the development of prediction or classification models. The goals and methods for constructing and evaluating models from each of these two paths overlap to some degree, but in many applications, there can be specific differences that should be understood to maximize their utility in both research and practice. We turn now to a brief overview of explanatory models and predictive models in an effort to elucidate some of the key distinctions in these approaches that are needed in order to understand how predictive models developed using machine learning methods are evaluated in practice.
A Recap of Explanatory Models
In many social sciences applications, a relevant underlying theoretical model posits a functional relationship between constructs and an outcome of interest. These constructs are then operationalized into variables that are then used in the explanatory model for exploration and hypotheses testing. For example, researchers who are looking to understand the adoption of new technologies might posit a path model that is informed by the underlying theoretical technology adoption lifecycle model (Rogers 1962). Taking one step beyond explanation, these models can also be used to make causal inferences about the nature of the relationships between the observed variables and the outcome of interest. Another common interest among survey researchers is understanding correlates of nonresponse as well as possible causal pathways of it. In fact, survey researchers have a long history of conducting nonresponse follow-up surveys to gather additional information thought to be related to survey participation, or in the causal pathway, that go beyond known auxiliary variables. An explanatory model can be constructed using all of the available information and then used to test various hypotheses about how the variables, or relationships among variables, impact survey participation. But this type of model may have very limited utility for predicting nonresponse as it contains variables not likely to be available from all sampled units prior to the survey.
Explanatory models are commonly used in research and practice to facilitate statistical inferences rather than to make predictions, per se. The underlying shape (e.g., linear, polynomial terms, nonlinear terms) and content of these models is often informed by the underlying theory, experience of the researcher, or prior literature. Well-constructed explanatory models are then used to investigate hypotheses related to the underlying theory as well as to explore relationships among the predictor variables and the outcome of interest. These models are constructed to maximize explanatory power (e.g., percentage of observed variance explained) and proper specification to minimize bias while also being attentive to parsimony. Hence, evaluation of these models focuses on goodness of fit; simplifications of the models are driven by evaluating the significance of the predictors and overall goodness of fit indices. The inclusion of important predictors in the final model is often quantified using effect size measures, confidence intervals, or p-values for estimated coefficients.
The Basics of Predictive Modeling
In contrast to explanatory models that explore relationships among observed variables or confer hypotheses, prediction or classification models are constructed with the primary purpose of predicting or classifying continuous or categorical outcomes, respectively, for new cases not yet observed. Prediction for continuous numeric variables, also referred to as quantitative variables, is usually referred to as a regression problem, whereas prediction for categorical, qualitative variables is referred to as a classification problem. For example, in responsive survey designs, it is often useful to have an accurate classification of which sampled units are likely to respond to a survey and which are not. Within an online survey panel context, it might also be useful to know which respondents are likely to leave an item missing on a questionnaire and which respondents are not. Armed with these predicted classifications, researchers and practitioners can tailor the survey experience in an attempt to mitigate the negative consequences of nonresponse or item missingness.
Predictive models are constructed from data and leverage associations between predictor variables and the outcome of interest. These models are constructed by minimizing both estimation variance and bias, and because of this, balance predictive models, in the end, may trade off some accuracy for improved empirical precision (Shmueli 2010). In contrast to many explanatory models, the actual functional form of the predictive model is often not specified in advance as these models place much less emphasis on the value of individual predictor variables and much more emphasis on the overall prediction accuracy. In fact, most predictive models that are constructed using various machine learning methods produce no table of coefficient estimates or specific statistics that evaluate the significance of a given predictor variable. And because the focus of these models is on prediction, they must use variables that are available prior to observing the outcome of interest. Such variables are said to have ex-ante availability. In the case of responsive designs, where a prediction of nonresponse is desired in real time throughout the field period, the types of ex-ante variables may include auxiliary variables known for all sampling units or paradata that are collected on all sampled units during an initial field period. Certainly, these variables should be associated with survey response, but they may not provide a complete picture of why sampled persons or households participate in the survey or answer a given item. But the purpose and use of these models has less to do with fully explaining or confirming the causal mechanisms of nonresponse and more to do with correctly classifying sampled units as respondents or nonrespondents, and using this classification as the basis of tailoring or adjustment.
Evaluating Predictive Models Created Using Machine Learning Methods
Compared to traditional statistical methods, machine learning techniques are more prone to overfitting the data, that is, to detecting patterns that might not generalize to other data. Model development in machine learning hence usually relies on so-called cross-validation as one method to curb the risk of overfitting. Cross-validation can be implemented in different ways but the general idea is to use a subsample of the data, referred to as a training or estimation sample, to develop a predictive model. The remaining sample, not included in the training subsample, is referred to as a test or holdout sample and is used to evaluate the accuracy of the predictive model developed using the training sample. Some machine learning techniques use a third subsample for tuning purposes, that is, the validation sample, to find those tuning parameters that yield the most optimal prediction. In these cases, once a model has been constructed using the training sample and refined using the validation sample, its overall performance is then evaluated using the test sample. For supervised learners, these three samples contain both the predictor variables (or features) and the outcome (or target) of interest.
The predictive accuracy for machine learning algorithms applied to continuous outcomes (e.g., regression problems) are usually quantified using a root mean squared error statistic that compares the observed value of the outcome to a predicted value. In classification problems, the predictive accuracy can be estimated using a host of statistics including: sensitivity, specificity, and overall accuracy. Generally, the computation of these and related measures of accuracy are based on a confusion matrix, which is simply a cross-tabulated table with the rows denoting the actual value of the target variable for every sample or case in the test set and the columns representing the values of the predicted level of the target variable for every sample or case in the test set. An example confusion matrix applied to a binary classification problem displaying the counts of cases in each of its four cells is displayed in Table 1. The abbreviations in Table 1 represent: the number of true positives – that is the number of cases that were predicted to be a “Yes” for the binary target variable that actually had that value; the number of false negatives – that is the number of cases that had an actual value of “Yes” for the target variable but which were predicted to be a “No”; the number of false positives – that is the number of cases that had an actual value of “No” but which were predicted to be a “Yes” and finally, the number of true negatives – that is the number of cases that had an actual value of “No” that were predicted to be as such.
**Table 1** A typical confusion matrix for a binary classification problem displaying cell counts.
Actual class
Predicted class
Yes (1)
No (0)
Yes (1)
TP
FN
No (0)
FP
TN
TP = True positive; FN = False negative
FP = False positive; TN = True negative
As mentioned earlier, there are a host of statistics that can be computed to estimate the accuracy of machine learning models applied to binary classification problems. Many of these statistics can be extended to the case of more than two levels in the target variable of interest. Since many of the survey related outcomes like survey response can be posed as a binary classification problem, we will illustrate these accuracy metrics using the confusion matrix that is given in Table 1. In Table 2, we define several common accuracy metrics for binary classification problems explicitly in terms of the cell counts displayed in Table 1. One additional metric that is not simply defined in terms of the cells of the confusion matrix is the area under the curve (AUC) and receiver operating characteristic (ROC) curve. This curve plots the true positive rate (sensitivity) versus the false positive rate (1-specificity) for various object values of a cutoff used for creating the binary classifications. Values of the AUC statistic that are close to 0.5 indicate very poor fitting classification models, while values that are higher and closer to 1 indicate more accurate classification models. The technical interpretation of the AUC and ROC curve statistic is the probability that the classification model will rank a randomly chosen “Yes” case higher than a randomly chosen “No” case.
**Table 2** A battery of accuracy metrics for binary classification problems defined in terms of the cells of the confusion matrix displayed in Table 1.
Common Example Description
Within each of the four papers, we will apply the respective machine learning method to predict a simulated binary response outcome using several predictors using data from the 2012 US National Health Interview Survey (NHIS). Specifically, the demo data set (henceforth referred to as the DDS) consists of complete records from 26,785 adults aged 18+ that were extracted from the 2012 public use data file. More complete details about this specific data set have been described elsewhere (Buskirk and Kolenikov 2015), and a complete description of both the NHIS study and the entire corpus of survey data is available at: http://www.cdc.gov/nchs/nhis.htm
The primary application of each of the methods we discuss in the papers in this special edition will be to predict a binary survey response variable using a battery of demographic variables available in the DDS including: region, age, sex, education, race, income level, Hispanicity, employment status, ratio of family income to the poverty threshold and telephone status. The exact levels of these predictor variables are provided in Table 3. The binary survey response variable was randomly generated from a simulated probit model that was primarily a nonlinear function of these demographic variables. More specific information about the exact form of the simulated probit models and how the binary survey response was randomly generated for each adult in the DDS are provided in the online technical appendix.
**Table 3** Predictor variables used for generating the survey response outcome and modeling it using various machine learning methods.
To evaluate model performance, we used a split sample cross-validation approach that created a single training data set (trainDDS) consisting of a random subset of approximately 85% of the cases in DDS along with a test data set (testDDS) consisting of the remaining cases. Each of the methods described in this special issue was applied to predict the simulated survey binary response variable using the core set of aforementioned demographic variables. Specifically, models were estimated using data from all cases in the trainDDS. In turn, these estimated models were then applied to the testDDS. The performance of each of the methods was measured by how well the estimated models predicted survey response status for cases in testDDS using the following accuracy metrics: percent correctly classified, sensitivity, specificity, balanced accuracy (average of sensitivity and specificity), and the AUC.
How Learning These Vital Algorithms Can Enhance Your Skills in Machine Learning
List of Popular Machine Learning Algorithms
Conclusion
In a world where nearly all manual tasks are being automated, the definition of manual is changing. There are now many different types of Machine Learning algorithms, some of which can help computers play chess, perform surgeries, and get smarter and more personal.
We are living in an era of constant technological progress, and looking at how computing has advanced over the years, we can predict what’s to come in the days ahead.
One of the main features of this revolution that stands out is how computing tools and techniques have been democratized. Data scientists have built sophisticated data-crunching machines in the last 5 years by seamlessly executing advanced techniques. The results have been astounding.
The many different types of machine learning algorithms have been designed in such dynamic times to help solve real-world complex problems. The ml algorithms are automated and self-modifying to continue improving over time. Before we delve into the top 10 machine learning algorithms you should know, let’s take a look at the different types of machine learning algorithms and how they are classified.
Machine learning algorithms are classified into 4 types:
Supervised
Unsupervised Learning
Semi-supervised Learning
Reinforcement Learning
Read More: Supervised and Unsupervised Learning in Machine Learning
However, these four types of ml algorithms are further classified into more types.
Below is the list of Top 10 commonly used Machine Learning (ML) Algorithms:
Linear regression
Logistic regression
Decision tree
SVM algorithm
Naive Bayes algorithm
KNN algorithm
K-means
Random forest algorithm
Dimensionality reduction algorithms
Gradient boosting algorithm and AdaBoosting algorithm
Read More: How to Become a Machine Learning Engineer?
How Learning These Vital Algorithms Can Enhance Your Skills in Machine Learning
If you’re a data scientist or a machine learning enthusiast, you can use these techniques to create functional Machine Learning projects.
There are three types of most popular Machine Learning algorithms, i.e – supervised learning, unsupervised learning, and reinforcement learning. All three techniques are used in this list of 10 common Machine Learning Algorithms:
Also Read: Training for a Career in AI & Machine Learning
List of Popular Machine Learning Algorithms
1. Linear Regression
To understand the working functionality of Linear Regression, imagine how you would arrange random logs of wood in increasing order of their weight. There is a catch; however – you cannot weigh each log. You have to guess its weight just by looking at the height and girth of the log (visual analysis) and arranging them using a combination of these visible parameters. This is what linear regression in machine learning is like.
In this process, a relationship is established between independent and dependent variables by fitting them to a line. This line is known as the regression line and is represented by a linear equation Y= a *X + b.
In this equation:
Y – Dependent Variable
a – Slope
X – Independent variable
b – Intercept
The coefficients a & b are derived by minimizing the sum of the squared difference of distance between data points and the regression line.
2. Logistic Regression
Logistic Regression is used to estimate discrete values (usually binary values like 0/1) from a set of independent variables. It helps predict the probability of an event by fitting data to a logit function. It is also called logit regression.
These methods listed below are often used to help improve logistic regression models:
include interaction terms
eliminate features
regularize techniques
use a non-linear model
3. Decision Tree
Decision Tree algorithm in machine learning is one of the most popular algorithm in use today; this is a supervised learning algorithm that is used for classifying problems. It works well in classifying both categorical and continuous dependent variables. This algorithm divides the population into two or more homogeneous sets based on the most significant attributes/ independent variables.
4. SVM (Support Vector Machine) Algorithm
SVM algorithm is a method of a classification algorithm in which you plot raw data as points in an n-dimensional space (where n is the number of features you have). The value of each feature is then tied to a particular coordinate, making it easy to classify the data. Lines called classifiers can be used to split the data and plot them on a graph.
5. Naive Bayes Algorithm
A Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
Even if these features are related to each other, a Naive Bayes classifier would consider all of these properties independently when calculating the probability of a particular outcome.
A Naive Bayesian model is easy to build and useful for massive datasets. It’s simple and is known to outperform even highly sophisticated classification methods.
6. KNN (K- Nearest Neighbors) Algorithm
This algorithm can be applied to both classification and regression problems. Apparently, within the Data Science industry, it’s more widely used to solve classification problems. It’s a simple algorithm that stores all available cases and classifies any new cases by taking a majority vote of its k neighbors. The case is then assigned to the class with which it has the most in common. A distance function performs this measurement.
KNN can be easily understood by comparing it to real life. For example, if you want information about a person, it makes sense to talk to his or her friends and colleagues!
Things to consider before selecting K Nearest Neighbours Algorithm:
KNN is computationally expensive
Variables should be normalized, or else higher range variables can bias the algorithm
Data still needs to be pre-processed.
7. K-Means
It is an unsupervised learning algorithm that solves clustering problems. Data sets are classified into a particular number of clusters (let’s call that number K) in such a way that all the data points within a cluster are homogenous and heterogeneous from the data in other clusters.
How K-means forms clusters:
The K-means algorithm picks k number of points, called centroids, for each cluster.
Each data point forms a cluster with the closest centroids, i.e., K clusters.
It now creates new centroids based on the existing cluster members.
With these new centroids, the closest distance for each data point is determined. This process is repeated until the centroids do not change.
8. Random Forest Algorithm
A collective of decision trees is called a Random Forest. To classify a new object based on its attributes, each tree is classified, and the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest).
Each tree is planted & grown as follows:
If the number of cases in the training set is N, then a sample of N cases is taken at random. This sample will be the training set for growing the tree.
If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M, and the best split on this m is used to split the node. The value of m is held constant during this process.
Each tree is grown to the most substantial extent possible. There is no pruning.
9. Dimensionality Reduction Algorithms
In today’s world, vast amounts of data are being stored and analyzed by corporates, government agencies, and research organizations. As a data scientist, you know that this raw data contains a lot of information – the challenge is to identify significant patterns and variables.
Dimensionality reduction algorithms like Decision Tree, Factor Analysis, Missing Value Ratio, and Random Forest can help you find relevant details.
10. Gradient Boosting Algorithm and AdaBoosting Algorithm
Gradient Boosting Algorithm and AdaBoosting Algorithm are boosting algorithms used when massive loads of data have to be handled to make predictions with high accuracy. Boosting is an ensemble learning algorithm that combines the predictive power of several base estimators to improve robustness.
In short, it combines multiple weak or average predictors to build a strong predictor. These boosting algorithms always work well in data science competitions like Kaggle, AV Hackathon, CrowdAnalytix. These are the most preferred machine learning algorithms today. Use them, along with Python and R Codes, to achieve accurate outcomes.
Conclusion
If you want to build a career in machine learning, start right away. The field is increasing, and the sooner you understand the scope of machine learning tools, the sooner you’ll be able to provide solutions to complex work problems. However, if you are experienced in the field and want to boost your career, you can take-up the Post Graduate Program in AI and Machine Learning in partnership with Purdue University collaborated with IBM. This program gives you an in-depth knowledge of Python, Deep Learning algorithm with the Tensor flow, Natural Language Processing, Speech Recognition, Computer Vision, and Reinforcement Learning.
Also, prepare yourself for Machine Learning interview questions to land at your dream job!
A machine learning model is a program that can find patterns or make decisions from a previously unseen dataset. For example, in natural language processing, machine learning models can parse and correctly recognize the intent behind previously unheard sentences or combinations of words. In image recognition, a machine learning model can be taught to recognize objects – such as cars or dogs. A machine learning model can perform such tasks by having it ‘trained’ with a large dataset. During training, the machine learning algorithm is optimized to find certain patterns or outputs from the dataset, depending on the task. The output of this process – often a computer program with specific rules and data structures – is called a machine learning model.
What is a machine learning Algorithm?
A machine learning algorithm is a mathematical method to find patterns in a set of data. Machine Learning algorithms are often drawn from statistics, calculus, and linear algebra. Some popular examples of machine learning algorithms include linear regression, decision trees, random forest, and XGBoost.
What is Model Training in machine learning?
The process of running a machine learning algorithm on a dataset (called training data) and optimizing the algorithm to find certain patterns or outputs is called model training. The resulting function with rules and data structures is called the trained machine learning model.
What are the different types of Machine Learning?
In general, most machine learning techniques can be classified into supervised learning, unsupervised learning, and reinforcement learning.
What is Supervised Machine Learning?
In supervised machine learning, the algorithm is provided an input dataset, and is rewarded or optimized to meet a set of specific outputs. For example, supervised machine learning is widely deployed in image recognition, utilizing a technique called classification. Supervised machine learning is also used in predicting demographics such as population growth or health metrics, utilizing a technique called regression.
What is Unsupervised Machine Learning?
In unsupervised machine learning, the algorithm is provided an input dataset, but not rewarded or optimized to specific outputs, and instead trained to group objects by common characteristics. For example, recommendation engines on online stores rely on unsupervised machine learning, specifically a technique called clustering.
What is Reinforcement Learning?
In reinforcement learning, the algorithm is made to train itself using many trial and error experiments. Reinforcement learning happens when the algorithm interacts continually with the environment, rather than relying on training data. One of the most popular examples of reinforcement learning is autonomous driving.
What are the different machine learning models?
There are many machine learning models, and almost all of them are based on certain machine learning algorithms. Popular classification and regression algorithms fall under supervised machine learning, and clustering algorithms are generally deployed in unsupervised machine learning scenarios.
Supervised Machine Learning
Logistic Regression: Logistic Regression is used to determine if an input belongs to a certain group or not
SVM: SVM, or Support Vector Machines create coordinates for each object in an n-dimensional space and uses a hyperplane to group objects by common features
Naive Bayes: Naive Bayes is an algorithm that assumes independence among variables and uses probability to classify objects based on features
Decision Trees: Decision trees are also classifiers that are used to determine what category an input falls into by traversing the leaf’s and nodes of a tree
Linear Regression: Linear regression is used to identify relationships between the variable of interest and the inputs, and predict its values based on the values of the input variables.
kNN: The k Nearest Neighbors technique involves grouping the closest objects in a dataset and finding the most frequent or average characteristics among the objects.
Random Forest: Random forest is a collection of many decision trees from random subsets of the data, resulting in a combination of trees that may be more accurate in prediction than a single decision tree.
Boosting algorithms: Boosting algorithms, such as Gradient Boosting Machine, XGBoost, and LightGBM, use ensemble learning. They combine the predictions from multiple algorithms (such as decision trees) while taking into account the error from the previous algorithm.
Unsupervised Machine Learning
K-Means: The K-Means algorithm finds similarities between objects and groups them into K different clusters.
Hierarchical Clustering: Hierarchical clustering builds a tree of nested clusters without having to specify the number of clusters.
What is a Decision Tree in Machine Learning (ML)?
A Decision Tree is a predictive approach in ML to determine what class an object belongs to. As the name suggests, a decision tree is a tree-like flow chart where the class of an object is determined step-by-step using certain known conditions.
A decision tree visualized in the Databricks
What is Regression in Machine Learning?
Regression in data science and machine learning is a statistical method that enables predicting outcomes based on a set of input variables. The outcome is often a variable that depends on a combination of the input variables.
A linear regression model performed on the Databricks
What is a Classifier in Machine Learning?
A classifier is a machine learning algorithm that assigns an object as a member of a category or group. For example, classifiers are used to detect if an email is spam, or if a transaction is fraudulent.
How many models are there in machine learning?
Many! Machine learning is an evolving field and there are always more machine learning models being developed.
What is the best model for machine learning?
The machine learning model most suited for a specific situation depends on the desired outcome. For example, to predict the number of vehicle purchases in a city from historical data, a supervised learning technique such as linear regression might be most useful. On the other hand, to identify if a potential customer in that city would purchase a vehicle, given their income and commuting history, a decision tree might work best.
What is model deployment in Machine Learning (ML)?
Model deployment is the process of making a machine learning model available for use on a target environment—for testing or production. The model is usually integrated with other applications in the environment (such as databases and UI) through APIs. Deployment is the stage after which an organization can actually make a return on the heavy investment made in model development.
A full machine learning model lifecycle on the Databricks Lakehouse.
What are Deep Learning Models?
Deep learning models are a class of ML models that imitate the way humans process information. The model consists of several layers of processing (hence the term ‘deep’) to extract high-level features from the data provided. Each processing layer passes on a more abstract representation of the data to the next layer, with the final layer providing a more human-like insight. Unlike traditional ML models which require data to be labeled, deep learning models can ingest large amounts of unstructured data. They are used to perform more human-like functions such as facial recognition and natural language processing.
A simplified representation of deep learning.Source: https://www.databricks.com/discover/pages/the-democratization-of-artificial-intelligence-and-deep-learning
What is Time Series Machine Learning?
A time-series machine learning model is one in which one of the independent variables is a successive length of time minutes, days, years etc.), and has a bearing on the dependent or predicted variable. Time series machine learning models are used to predict time-bound events, for example – the weather in a future week, expected number of customers in a future month, revenue guidance for a future year, and so on.
Where can I learn more about machine learning?
Check out this free eBook to discover the many fascinating machine learning use-cases being deployed by enterprises globally.
To get a deeper understanding of machine learning from the experts, check out the Databricks Machine Learning blog.
Google’s self-driving cars and robots get a lot of press, but the company’s real future is in machine learning, the technology that enables computers to get smarter and more personal.– Eric Schmidt (Google Chairman)
We are probably living in the most defining period of human history. The period when computing moved from large mainframes to PCs to the cloud. But what makes it defining is not what has happened but what is coming our way in years to come. What makes this period exciting and enthralling for someone like me is the democratization of the various tools, techniques, and machine learning algorithms that followed the boost in computing. Welcome to the world of data science!
Today, as a data scientist, I can build data-crunching machines with complex algorithms for a few dollars per hour. But reaching here wasn’t easy! I had my dark days and nights.
Learning Objectives
Major focus on commonly used machine learning techniques and algorithms.
Algorithms covered – Linear regression, logistic regression, Naive Bayes, kNN, Random forest, etc.
Learn both theory and implementation of the machine learning algorithms in R and python.
Are you a beginner looking for a place to start your data science journey and learn machine learning models? Presenting a list of comprehensive courses, full of knowledge and data science learning, curated just for you to learn data science (using Python) from scratch:
Machine Learning Certification Course for Beginners
Introduction to Data Science
Certified AI & ML Blackbelt+ Program
Table of Contents
Who Can Benefit the Most From This Guide?
3 Types Of Machine Learning Algorithms
List of Common Machine Learning Algorithms
Gradient Boosting Algorithms
Practice Problems
Conclusion
Who Can Benefit the Most From This Guide?
What I am giving out today is probably the most valuable guide I have ever created.
The idea behind creating this guide is to simplify the journey of aspiring data scientists and machine learning (which is part of artificial intelligence) enthusiasts across the world. Through this guide, I will enable you to work on machine-learning problems and gain from experience. I am providing a high-level understanding of various machine learning algorithms along with R & Python codes to run them. These should be sufficient to get your hands dirty. You can also check out our Machine Learning Course.
Essentials of machine learning algorithms with implementation in R and Python
I have deliberately skipped the statistics behind these techniques and artificial neural networks, as you don’t need to understand them initially. So, if you are looking for a statistical understanding of these algorithms, you should look elsewhere. But, if you want to equip yourself to start building a machine learning project, you are in for a treat.
3 Types of Machine Learning Algorithms
Supervised Learning Algorithms
How it works: This algorithm consists of a target/outcome variable (or dependent variable) which is to be predicted from a given set of predictors (independent variables). Using this set of variables, we generate a function that maps input data to desired outputs. The training process continues until the model achieves the desired level of accuracy on the training data. Examples of Supervised Learning: Regression, Decision Tree, Random Forest, KNN, Logistic Regression, etc.
Unsupervised Learning Algorithms
How it works: In this algorithm, we do not have any target or outcome variable to predict / estimate (which is called unlabelled data). It is used for recommendation systems or clustering populations in different groups. clustering algorithms are widely used for segmenting customers into different groups for specific interventions. Examples of Unsupervised Learning: Apriori algorithm, K-means clustering.
Reinforcement Learning Algorithms
How it works: Using this algorithm, the machine is trained to make specific decisions. The machine is exposed to an environment where it trains itself continually using trial and error. This machine learns from past experience and tries to capture the best possible knowledge to make accurate business decisions. Example of Reinforcement Learning: Markov Decision Process
List of Top 10 Common Machine Learning Algorithms
Here is the list of commonly used machine learning algorithms. These algorithms can be applied to almost any data problem:
Linear Regression
Logistic Regression
Decision Tree
SVM
Naive Bayes
kNN
K-Means
Random Forest
Dimensionality Reduction Algorithms
Gradient Boosting algorithms
GBM
XGBoost
LightGBM
CatBoost
Linear Regression
It is used to estimate real values (cost of houses, number of calls, total sales, etc.) based on a continuous variable(s). Here, we establish the relationship between independent and dependent variables by fitting the best line. This best-fit line is known as the regression line and is represented by a linear equation Y= a*X + b.
The best way to understand linear regression is to relive this experience of childhood. Let us say you ask a child in fifth grade to arrange people in his class by increasing the order of weight without asking them their weights! What do you think the child will do? He/she would likely look (visually analyze) at the height and build of people and arrange them using a combination of these visible parameters. This is linear regression in real life! The child has actually figured out that height and build would be correlated to weight by a relationship, which looks like the equation above.
In this equation:
Y – Dependent Variable
a – Slope
X – Independent variable
b – Intercept
These coefficients a and b are derived based on minimizing the sum of the squared difference of distance between data points and the regression line.
Look at the below example. Here we have identified the best-fit line having linear equation y=0.2811x+13.9. Now using this equation, we can find the weight, knowing the height of a person.
Linear Regression is mainly of two types: Simple Linear Regression and Multiple Linear Regression. Simple Linear Regression is characterized by one independent variable. And, Multiple Linear Regression(as the name suggests) is characterized by multiple (more than 1) independent variables. While finding the best-fit line, you can fit a polynomial or curvilinear regression. And these are known as polynomial or curvilinear regression.
Here’s a coding window to try out your hand and build your own linear regression model in Python:
R Code:
#Load Train and Test datasets
#Identify feature and response variable(s) and values must be numeric and numpy arrays
x_train <- input_variables_values_training_datasets
y_train <- target_variables_values_training_datasets
x_test <- input_variables_values_test_datasets
x <- cbind(x_train,y_train)
# Train the model using the training sets and check score
linear <- lm(y_train ~ ., data = x)
summary(linear)
#Predict Output
predicted= predict(linear,x_test)
Logistic Regression
Don’t get confused by its name! It is a classification algorithm, not a regression algorithm. It is used to estimate discrete values ( Binary values like 0/1, yes/no, true/false ) based on a given set of independent variable(s). In simple words, it predicts the probability of the occurrence of an event by fitting data to a logistic function. Hence, it is also known as logit regression. Since it predicts the probability, its output values lie between 0 and 1 (as expected).
Again, let us try and understand this through a simple example.
Let’s say your friend gives you a puzzle to solve. There are only 2 outcome scenarios – either you solve it, or you don’t. Now imagine that you are being given a wide range of puzzles/quizzes in an attempt to understand which subjects you are good at. The outcome of this study would be something like this – if you are given a trigonometry-based tenth-grade problem, you are 70% likely to solve it. On the other hand, if it is a grade fifth history question, the probability of getting an answer is only 30%. This is what Logistic Regression provides you.
Coming to the math, the log odds of the outcome are modeled as a linear combination of the predictor variables.
odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence
ln(odds) = ln(p/(1-p))
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk
Above, p is the probability of the presence of the characteristic of interest. It chooses parameters that maximize the likelihood of observing the sample values rather than that minimize the sum of squared errors (like in ordinary regression).
Now, you may ask, why take a log? For the sake of simplicity, let’s just say that this is one of the best mathematical ways to replicate a step function. I can go into more details, but that will beat the purpose of this article.
Build your own logistic regression model in Python here and check the accuracy:
R Code:
x <- cbind(x_train,y_train)
# Train the model using the training sets and check score
logistic <- glm(y_train ~ ., data = x,family='binomial')
summary(logistic)
#Predict Output
predicted= predict(logistic,x_test)
Furthermore…
There are many different steps that could be tried in order to improve the model:
including interaction terms
removing features
regularization techniques
using a non-linear model
Decision Tree
This is one of my favorite algorithms, and I use it quite frequently. It is a type of supervised learning algorithm that is mostly used for classification problems. Surprisingly, it works for both categorical and continuous dependent variables. In this algorithm, we split the population into two or more homogeneous sets. This is done based on the most significant attributes/ independent variables to make as distinct groups as possible. For more details, you can read Decision Tree Simplified.
Source: statsexchange
In the image above, you can see that population is classified into four different groups based on multiple attributes to identify ‘if they will play or not’. To split the population into different heterogeneous groups, it uses various techniques like Gini, Information Gain, Chi-square, and entropy.
The best way to understand how the decision tree works, is to play Jezzball – a classic game from Microsoft (image below). Essentially, you have a room with moving walls and you need to create walls such that the maximum area gets cleared off without the balls.
So, every time you split the room with a wall, you are trying to create 2 different populations within the same room. Decision trees work in a very similar fashion by dividing a population into as different groups as possible.
More: Simplified Version of Decision Tree Algorithms
Let’s get our hands dirty and code our own decision tree in Python!
R Code:
library(rpart)
x <- cbind(x_train,y_train)
# grow tree
fit <- rpart(y_train ~ ., data = x,method="class")
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
SVM (Support Vector Machine)
It is a classification method. In this algorithm, we plot each data item as a point in n-dimensional space (where n is the number of features you have), with the value of each feature being the value of a particular coordinate.
For example, if we only had two features like the Height and Hair length of an individual, we’d first plot these two variables in two-dimensional space where each point has two coordinates (these co-ordinates are known as Support Vectors)
Now, we will find some lines that split the data between the two differently classified groups of data. This will be the line such that the distances from the closest point in each of the two groups will be the farthest away. If there are more variables, a hyperplane is used to separate the classes.
In the example shown above, the line which splits the data into two differently classified groups is the blackline since the two closest points are the farthest apart from the line. This line is our classifier. Then, depending on where the testing data lands on either side of the line, that’s what class we can classify the new data as.
More: Simplified Version of Support Vector Machine Think of this algorithm as playing JezzBall in n-dimensional space. The tweaks in the game are:
You can draw lines/planes at any angle (rather than just horizontal or vertical as in the classic game)
The objective of the game is to segregate balls of different colors in different rooms.
And the balls are not moving.
Try your hand and design an SVM model in Python through this coding window:
R Code:
library(e1071)
x <- cbind(x_train,y_train)
# Fitting model
fit <-svm(y_train ~ ., data = x)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
Naive Bayes
It is a classification technique based on Bayes’ theorem with an assumption of independence between predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier would consider all of these properties to independently contribute to the probability that this fruit is an apple.
The Naive Bayesian model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x), and P(x|c). Look at the equation below:
Here,
P(c|x) is the posterior probability of class (target) given predictor (attribute).
P(c) is the prior probability of the class.
P(x|c) is the likelihood which is the probability of the predictor given the class.
P(x) is the prior probability of the predictor.
Example: Let’s understand it using an example. Below is a training data set of weather and the corresponding target variable, ‘Play.’ Now, we need to classify whether players will play or not based on weather conditions. Let’s follow the below steps to perform it.
Step 1: Convert the data set to a frequency table.
Step 2: Create a Likelihood table by finding the probabilities like Overcast probability = 0.29 and probability of playing is 0.64.
Step 3: Now, use the Naive Bayesian equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of the prediction.
Problem: Players will pay if the weather is sunny. Is this statement correct?
We can solve it using above discussed method, so P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)
Here we have P (Sunny | Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P(Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has a higher probability.
Naive Bayes uses a similar method to predict the probability of different classes based on various attributes. This algorithm is mostly used in text classification and with problems having multiple classes.
Code for a Naive Bayes classification model in Python:
R Code:
library(e1071)
x <- cbind(x_train,y_train)
# Fitting model
fit <-naiveBayes(y_train ~ ., data = x)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
kNN (k- Nearest Neighbors)
It can be used for both classification and regression problems. However, it is more widely used in classification problems in the industry. K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. The case assigned to the class is most common amongst its K nearest neighbors measured by a distance function.
These distance functions can be Euclidean, Manhattan, Minkowski, and Hamming distances. The first three functions are used for continuous functions, and the fourth one (Hamming) for categorical variables. If K = 1, then the case is simply assigned to the class of its nearest neighbor. At times, choosing K turns out to be a challenge while performing kNN modeling.
More: Introduction to k-nearest neighbors: Simplified.
KNN can easily be mapped to our real lives. If you want to learn about a person with whom you have no information, you might like to find out about his close friends and the circles he moves in and gain access to his/her information!
Things to consider before selecting kNN:
KNN is computationally expensive
Variables should be normalized else higher range variables can bias it
Works on pre-processing stage more before going for kNN like an outlier, noise removal
Python Code:
R Code:
library(knn)
x <- cbind(x_train,y_train)
# Fitting model
fit <-knn(y_train ~ ., data = x,k=5)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
K-Means
It is a type of unsupervised algorithm which solves the clustering problem. Its procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters). Data points inside a cluster are homogeneous and heterogeneous to peer groups.
Remember figuring out shapes from ink blots? k means is somewhat similar to this activity. You look at the shape and spread to decipher how many different clusters/populations are present!
How K-means forms cluster:
K-means picks k number of points for each cluster known as centroids.
Each data point forms a cluster with the closest centroids, i.e., k clusters.
Finds the centroid of each cluster based on existing cluster members. Here we have new centroids.
As we have new centroids, repeat steps 2 and 3. Find the closest distance for each data point from new centroids and get associated with new k-clusters. Repeat this process until convergence occurs, i.e., centroids do not change.
How to determine the value of K:
In K-means, we have clusters, and each cluster has its own centroid. The sum of the square of the difference between the centroid and the data points within a cluster constitutes the sum of the square value for that cluster. Also, when the sum of square values for all the clusters is added, it becomes a total within the sum of the square value for the cluster solution.
We know that as the number of clusters increases, this value keeps on decreasing, but if you plot the result, you may see that the sum of squared distance decreases sharply up to some value of k and then much more slowly after that. Here, we can find the optimum number of clusters.
Python Code:
R Code:
library(cluster)
fit <- kmeans(X, 3) # 5 cluster solution
Random Forest
Random Forest is a trademarked term for an ensemble learning of decision trees. In Random Forest, we’ve got a collection of decision trees (also known as “Forest”). To classify a new object based on attributes, each tree gives a classification, and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest).
Each tree is planted & grown as follows:
If the number of cases in the training set is N, then a sample of N cases is taken at random but with replacement. This sample will be the training set for growing the tree.
If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M, and the best split on this m is used to split the node. The value of m is held constant during the forest growth.
Each tree is grown to the largest extent possible. There is no pruning.
For more details on this algorithm, compared with the decision tree and tuning model parameters, I would suggest you read these articles:
Introduction to Random forest – Simplified
Comparing a CART model to Random Forest (Part 1)
Comparing a Random Forest to a CART model (Part 2)
Tuning the parameters of your Random Forest model
Python Code:
R Code:
library(randomForest)
x <- cbind(x_train,y_train)
# Fitting model
fit <- randomForest(Species ~ ., x,ntree=500)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
Dimensionality Reduction Algorithms
In the last 4-5 years, there has been an exponential increase in data capturing at every possible stage. Corporates/ Government Agencies/ Research organizations are not only coming up with new sources, but also they are capturing data in great detail.
For example, E-commerce companies are capturing more details about customers like their demographics, web crawling history, what they like or dislike, purchase history, feedback, and many others to give them personalized attention more than your nearest grocery shopkeeper.
As data scientists, the data we are offered also consists of many features, this sounds good for building a good robust model, but there is a challenge. How’d you identify highly significant variable(s) out of 1000 or 2000? In such cases, the dimensionality reduction algorithm helps us, along with various other algorithms like Decision Tree, Random Forest, PCA (principal component analysis), Factor Analysis, Identity-based on the correlation matrix, missing value ratio, and others.
To know more about these algorithms, you can read “Beginners Guide To Learn Dimension Reduction Techniques“.
Now, let’s look at the 4 most commonly used gradient boosting algorithms.
GBM
GBM is a boosting algorithm used when we deal with plenty of data to make a prediction with high prediction power. Boosting is actually an ensemble of learning algorithms that combines the prediction of several base estimators in order to improve robustness over a single estimator. It combines multiple weak or average predictors to build a strong predictor. These boosting algorithms always work well in data science competitions like Kaggle, AV Hackathon, and CrowdAnalytix.
More: Know about Boosting algorithms in detail Python Code:
R Code:
library(caret)
x <- cbind(x_train,y_train)
# Fitting model
fitControl <- trainControl( method = "repeatedcv", number = 4, repeats = 4)
fit <- train(y ~ ., data = x, method = "gbm", trControl = fitControl,verbose = FALSE)
predicted= predict(fit,x_test,type= "prob")[,2]
GradientBoostingClassifier and Random Forest are two different boosting tree classifiers, and often people ask about the difference between these two algorithms.
XGBoost
Another classic gradient-boosting algorithm that’s known to be the decisive choice between winning and losing in some Kaggle competitions is the XGBoost. It has an immensely high predictive power, making it the best choice for accuracy in events. It possesses both a linear model and the tree learning algorithm, making the algorithm almost 10x faster than existing gradient booster techniques.
One of the most interesting things about the XGBoost is that it is also called a regularized boosting technique. This helps to reduce overfit modeling and has massive support for a range of languages such as Scala, Java, R, Python, Julia, and C++.
The support includes various objective functions, including regression, classification, and ranking. Supports distributed and widespread training on many machines that encompass GCE, AWS, Azure, and Yarn clusters. XGBoost can also be integrated with Spark, Flink, and other cloud dataflow systems with built-in cross-validation at each iteration of the boosting process.
Python Code:
R Code:
require(caret)
x <- cbind(x_train,y_train)
# Fitting model
TrainControl <- trainControl( method = "repeatedcv", number = 10, repeats = 4)
model<- train(y ~ ., data = x, method = "xgbLinear", trControl = TrainControl,verbose = FALSE)
OR
model<- train(y ~ ., data = x, method = "xgbTree", trControl = TrainControl,verbose = FALSE)
predicted <- predict(model, x_test)
LightGBM
LightGBM is a gradient-boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient with the following advantages:
Faster training speed and higher efficiency
Lower memory usage
Better accuracy
Parallel and GPU learning supported
Capable of handling large-scale data
The framework is a fast and high-performance gradient-boosting one based on decision tree algorithms used for ranking, classification, and many other machine-learning tasks. It was developed under the Distributed Machine Learning Toolkit Project of Microsoft.
Since the LightGBM is based on decision tree algorithms, it splits the tree leaf-wise with the best fit, whereas other boosting algorithms split the tree depth-wise or level-wise rather than leaf-wise. So when growing on the same leaf node in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm, resulting in much better accuracy, which any existing boosting algorithms can rarely achieve.
Also, it is surprisingly very fast, hence the word ‘Light.’
Python Code:
data = np.random.rand(500, 10) # 500 entities, each contains 10 features
label = np.random.randint(2, size=500) # binary target
train_data = lgb.Dataset(data, label=label)
test_data = train_data.create_valid('test.svm')
param = {'num_leaves':31, 'num_trees':100, 'objective':'binary'}
param['metric'] = 'auc'
num_round = 10
bst = lgb.train(param, train_data, num_round, valid_sets=[test_data])
bst.save_model('model.txt')
# 7 entities, each contains 10 features
data = np.random.rand(7, 10)
ypred = bst.predict(data)
R Code:
library(RLightGBM)
data(example.binary)
#Parameters
num_iterations <- 100
config <- list(objective = "binary", metric="binary_logloss,auc", learning_rate = 0.1, num_leaves = 63, tree_learner = "serial", feature_fraction = 0.8, bagging_freq = 5, bagging_fraction = 0.8, min_data_in_leaf = 50, min_sum_hessian_in_leaf = 5.0)
#Create data handle and booster
handle.data <- lgbm.data.create(x)
lgbm.data.setField(handle.data, "label", y)
handle.booster <- lgbm.booster.create(handle.data, lapply(config, as.character))
#Train for num_iterations iterations and eval every 5 steps
lgbm.booster.train(handle.booster, num_iterations, 5)
#Predict
pred <- lgbm.booster.predict(handle.booster, x.test)
#Test accuracy
sum(y.test == (y.pred > 0.5)) / length(y.test)
#Save model (can be loaded again via lgbm.booster.load(filename))
lgbm.booster.save(handle.booster, filename = "/tmp/model.txt")
If you’re familiar with the Caret package in R, this is another way of implementing the LightGBM.
require(caret)
require(RLightGBM)
data(iris)
model <-caretModel.LGBM()
fit <- train(Species ~ ., data = iris, method=model, verbosity = 0)
print(fit)
y.pred <- predict(fit, iris[,1:4])
library(Matrix)
model.sparse <- caretModel.LGBM.sparse()
#Generate a sparse matrix
mat <- Matrix(as.matrix(iris[,1:4]), sparse = T)
fit <- train(data.frame(idx = 1:nrow(iris)), iris$Species, method = model.sparse, matrix = mat, verbosity = 0)
print(fit)
Catboost
CatBoost is one of open-sourced machine learning algorithms from Yandex. It can easily integrate with deep learning frameworks like Google’s TensorFlow and Apple’s Core ML. The best part about CatBoost is that it does not require extensive data training like other ML models and can work on a variety of data formats, not undermining how robust it can be.
Catboost can automatically deal with categorical variables without showing the type conversion error, which helps you to focus on tuning your model better rather than sorting out trivial errors. Make sure you handle missing data well before you proceed with the implementation.
Python Code:
import pandas as pd
import numpy as np
from catboost import CatBoostRegressor
#Read training and testing files
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
#Imputing missing values for both train and test
train.fillna(-999, inplace=True)
test.fillna(-999,inplace=True)
#Creating a training set for modeling and validation set to check model performance
X = train.drop(['Item_Outlet_Sales'], axis=1)
y = train.Item_Outlet_Sales
from sklearn.model_selection import train_test_split
X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.7, random_state=1234)
categorical_features_indices = np.where(X.dtypes != np.float)[0]
#importing library and building model
from catboost import CatBoostRegressormodel=CatBoostRegressor(iterations=50, depth=3, learning_rate=0.1, loss_function='RMSE')
model.fit(X_train, y_train,cat_features=categorical_features_indices,eval_set=(X_validation, y_validation),plot=True)
submission = pd.DataFrame()
submission['Item_Identifier'] = test['Item_Identifier']
submission['Outlet_Identifier'] = test['Outlet_Identifier']
submission['Item_Outlet_Sales'] = model.predict(test)
Now, it’s time to take the plunge and actually play with some other real-world datasets. So are you ready to take on the challenge? Accelerate your data science journey with the following practice problems:
Conclusion
By now, I am sure you would have an idea of commonly used machine learning algorithms. My sole intention behind writing this article and providing the codes in R and Python is to get you started right away. If you are keen to master machine learning algorithms, start right away. Take up problems, develop a physical understanding of the process, apply these codes, and watch the fun!
Key Takeaways
We are now familiar with some of the most common ML algorithms used in the industry.
We’ve covered the advantages and disadvantages of various ML algorithms.
We’ve also learned the basic implementation details in R and Python languages.
Frequently Asked Questions
Q1. Which algorithm is mostly used in machine learning?
A. While the suitable algorithm depends on the problem, gradient-boosted decision trees are mostly used to balance performance and interpretability.
Q2. What is the difference between supervised and unsupervised ML?
A. In the supervised learning model, the labels associated with the features are given. In unsupervised learning, no labels are provided for the model.
Q3. What are the main 3 types of ML models?
A. The 3 main types of ML models are based on Supervised Learning, Unsupervised Learning, and Reinforcement Learning.