Pentaho Data Integration Fundamentals Training Course

  • Introduction
  • Installing and Configuring Pentaho
  • Overview of Pentaho Features and Architecture
  • Understanding Pentaho’s In-Memory Caching
  • Navigating the User Interface
  • Connecting to a Data Source
  • Configuring the Pentaho Enterprise Repository
  • Transforming Data
  • Viewing the Transformation Results
  • Resolving Transformation Errors
  • Processing a Data Stream
  • Reusing Transformations
  • Scheduling Transformations
  • Securing Pentaho
  • Integrating with Third-party Applications (Hadoop, NoSQL, etc.)
  • Analytics and Reporting
  • Pentaho Design Patterns and Best Practices
  • Troubleshooting

Tableau Advanced Training Course

Introduction and Getting Started

  1. Filtering, Sorting & Grouping
    1. Advanced options for filtering and hiding
    2. Understanding many options for ordering and grouping your data
    3. Sort, Groups, Bins, Sets
    4. Interrelation between all options
  2. Working with Data in Tableau
    1. Dimension versus Measures
    2. Data types, Discrete versus Continous
    3. Joining Database sources,
    4. Inner, Left, Right join
    5. Blending different datasources in a single worksheet
    6. Working with extracts instead of live connections
    7. Data quality problems
    8. Metadata and sharing a connection
  3. Calculations on Data and Statistics
    1. Row-level calculations
    2. Aggregate calculations
    3. Arithmetic, string, date calculations
    4. Custom aggregations and calculated fields
    5. Control-flow calculations
    6. What is behind the scene
    7. Advanced Statistics
    8. Working with dates and times
  4. Table Calculations
    1. Quick table calculations
    2. Scope and direction
    3. Addressing and partitioning
    4. Advanced table calculations
  5. Advanced Geo techniques
    1. Building basic maps
    2. Geographic fields, map options
    3. Customizing a geographic view
    4. Web Map Service
    5. Visualizing non geographical data with background images
    6. Mapping tips
    7. Distance Calculations
  6. Parameters in tableau
    1. Creating parameters
    2. Parameters in calculated fields
    3. Parameter control options
    4. Enhancing analysis and visualizations with parameters
  7. Building Advanced Chart Visualizations
    1. Bar chart variations –bullet, bar-in-bar, highlights chart
    2. Date and time visualizations, gantt charts
    3. Stacked bars, treemaps, area charts, pie charts
    4. Heat map
    5. KPI chart
    6. Pareto chart
    7. Bullet chart
  8. Advanced formattting
    1. Labels
    2. Legends
    3. Highlighting
    4. Annotations
  9. Telling a data story with Dashboards
    1. Dashboard framework
    2. Filter actions
    3. Highlight actions
    4. URL actions
    5. Cascading filters
  10. Trends and Forecasting
    1. Understanding and Customizing trend lines
    2. Distributions
    3. Forecasting
  11. Integrating Tableau and R for advanced data analytics
    1. Possibility to include different data analytics methods in R on participants request

R and Python coding with Prython

Learn how to use Prython for coding both R and Python projects

Design complex data science projects in Prython

Requirements

  • Know some Python and/or R
  • Basic knowledge about data science and analytics

Description

In this course we will learn how to use Prython, which offers a different way of coding than existing R/Python IDEs. It allows us to drop our code into panels that we place and connect in a canvas. In a normal IDE your code will run linearly from start to end, making it really hard to create sub-experiments/tests, and also to organise your project clearly. In Prython each panel accepts multiple IN and OUT connections, effectively transforming it into a 2D Jupiter notebook. It also has a wide array of tools that complement this canvas functionality: such as displaying dataframes next to the panels that modified them, allowing you to freeze your outputs, attaching consoles, navigation markers, etc.

We assume that the student is already familiar with R or Python, and some familiarity with matplotlib, scikit-learn,or keras would be beneficial as well.

Who this course is for:

  • Python and R practitioners with a focus on data science
  • ML engineers
  • Statisticians, engineers, and economists designing statistical models

Course content

Introduction to Data Science using Python (Module 1/3)

Understand the basics of Data Science and Analytics

Understand how to use Python and Scikit learn

Get a good understanding of all buzz words like “Data Science”, “Machine learning”, “Data Scientist” etc.

Requirements

  • This course does not have any pre-requisities. All you need is a Windows or a MAC machine.

Description

Are you completely new to Data science?

Have you been hearing these buzz words like Machine learning, Data Science, Data Scientist, Text analytics, Statistics and don’t know what this is?

Do you want to start or switch career to Data Science and analytics?

If yes, then I have a new course for you. In this course, I cover the absolute basics of Data Science and Machine learning. This course will not cover in-depth algorithms. I have split this course into 3 Modules. This module, takes a 500,000ft. view of what Data science is and how is it used. We will go through commonly used terms and write some code in Python. I spend some time walking you through different career areas in the Business Intelligence Stack, where does Data Science fit in, What is Data Science and what are the tools you will need to get started. I will be using Python and Scikit-Learn Package in this course. I am not assuming any prior knowledge in this area. I have given some reading materials, which will help you solidify the concepts that are discussed in this lectures.

This course will the first data science course in a series of courses. Consider this course as a 101 level course, where I don’t go too much deep into any particular statistical area, but rather just cover enough to raise your curiosity in the field of Data Science and Analytics.

The other modules will cover more complex concepts. 

Who this course is for:

  • Anyone who wants to learn about Data Science from absolute scratch.
  • Anyone who wants to switch or make a career in Data Science and Analytics
  • Anyone who is curious to know what is Data Science and what does a Data Scientist do in his/her day job.

Course content

Simple Blogging Analytics Dashboard in Python

Understand the basics of web scraping

Understand how to setup a manual data pipeline

Learn how to modularize code into functions

See how to setup a basic dashboard in Flask

Requirements

  • Be able to program using Python
  • Understand how web development works

Description

This video series will walk through building a simple blogging analytics dashboard in Python

Here is a synopsis of each video:

  1. Talks about the project and data pipeline
  2. Talks about web scraping basics
  3. Shows how to scrape one blog article
  4. Shows how to scrape all the blog articles in one category
  5. Shows how to scrape all the blog articles in all the categories
  6. Shows how to compute basic analytics
  7. Shows basic design and front-end development
  8. Shows how to setup a Flask sever
  9. Shows how to deploy the app to Heroku
  10. Explores further improvements to the pipeline

Everything in the project is done manually to show the steps between. I plan to upload a 2nd version that shows how to automate the entire pipeline.

Who this course is for:

  • Beginner to advanced software developers interested in data engineering

Course content

Introduction to Data Science using Python (Module 1/3)

Understand the basics of Data Science and Analytics

Understand how to use Python and Scikit learn

Get a good understanding of all buzz words like “Data Science”, “Machine learning”, “Data Scientist” etc.

Requirements

  • This course does not have any pre-requisities. All you need is a Windows or a MAC machine.

Description

Are you completely new to Data science?

Have you been hearing these buzz words like Machine learning, Data Science, Data Scientist, Text analytics, Statistics and don’t know what this is?

Do you want to start or switch career to Data Science and analytics?

If yes, then I have a new course for you. In this course, I cover the absolute basics of Data Science and Machine learning. This course will not cover in-depth algorithms. I have split this course into 3 Modules. This module, takes a 500,000ft. view of what Data science is and how is it used. We will go through commonly used terms and write some code in Python. I spend some time walking you through different career areas in the Business Intelligence Stack, where does Data Science fit in, What is Data Science and what are the tools you will need to get started. I will be using Python and Scikit-Learn Package in this course. I am not assuming any prior knowledge in this area. I have given some reading materials, which will help you solidify the concepts that are discussed in this lectures.

This course will the first data science course in a series of courses. Consider this course as a 101 level course, where I don’t go too much deep into any particular statistical area, but rather just cover enough to raise your curiosity in the field of Data Science and Analytics.

The other modules will cover more complex concepts. 

Who this course is for:

  • Anyone who wants to learn about Data Science from absolute scratch.
  • Anyone who wants to switch or make a career in Data Science and Analytics
  • Anyone who is curious to know what is Data Science and what does a Data Scientist do in his/her day job.

Course content

Machine Learning and Higher Education

Software is eating the world, so said Marc Andreesen in 2011.1 These days it seems that machine learning and its specialized algorithms are eating the software world.2 Is it thus a foregone conclusion that machine learning will play a significant role in disrupting technology and shaping our future?

Machine learning concerns teaching machines to learn about something without explicit programming. At the core of machine learning is the idea of modeling and extracting useful information out of data. Societal trends clearly point to data as the resource of the future. Colleges and universities are already swimming in data, and there is much more on the way. Imagine a future in which computers are everywhere and interconnected with everything from clothes to refrigerators, phones, vending machines, and more. Some people have even proposed equipping toilets with sensors that collect data.3 Storing those data will be very cheap.4 These interconnected devices will produce quantities of data that are too large human analysis, requiring us to teach computers to look for patterns in the data, identify predictor variables, and even try to predict for those variables.

Organizations that adapt and adopt machine learning will have a bright future. Machine learning is a new tool in the box, and it is worth learning how to use.5 Colleges, universities, and other educational institutions often adopt disruptive technologies in novel ways and are therefore in a good position to use machine learning to improve higher education. Adopting a machine learning–centric data-science approach as a tool for administrators and faculty could be a game changer for higher education.

Before we discuss machine learning further, it is important to briefly discuss analytics and traditional statistics. It is true that not all predictive analytics needs to be done with machine learning. The traditional methods here are statistical methods such as time series forecasting or various forms of regression. These have been used successfully in many fields for several years. In this article, from a very high overview, we refer to analytics as the subfield of machine learning that is predictive analytics and relies on training algorithms with a labeled training set, otherwise known as supervised learning. A common example is weather.6 Suppose we are interested in predicting sunny days. We can do this by observing our entire data set and feed the conditions into an algorithm that will look at days that were sunny and days that were not. This model is then trained and then can be fed new data and make guesses about whether it is sunny. For our purposes, we are interested in using supervised methods to make predictions and unsupervised methods such as classification to find patterns in the data that we might not have seen.

It is important to discuss the potential benefits and recommendations for pursuing machine learning as a tool for educational experts. In addition, it is important to note potential limitations and ethical considerations. Although an in-depth discussion is beyond the scope of this article, our hope is to start a conversation among higher education administrators, faculty, and IT specialists regarding the potential of machine learning to help make more-informed and better decisions — in other words, get people interested in machine learning to try it and see how things go. We are practicing what we advocate in this article. Heath Yates is actively exploring new algorithmic approaches to machine learning, while Craig Chamberlain is applying machine learning to data in higher education.

Potential Benefits of Machine Learning in Higher Education

Our interest in machine learning began by doing some very simple clustering analysis parallel to k-nearest neighbor (kNN). Such techniques as kNN can assist in finding patterns in larger data for analysts. During the 2016–17 year, Chamberlain was approached by his university to look at a question posed by a donor: “Can we identify a group of students who need an additional scholarship that would eventually lead to increased retention?” After spending time with several data sets and after a lot of research, Chamberlain and his team identified a group of students who needed additional money to remain enrolled. At the time, many believed that increasing retention for this group was a long shot. However, after awarding these students additional scholarships, retention rose from approximately 64% to about 90%. This effort has had two distinct benefits. The most important is that it contributed to the continued success of those students. The second is that it resulted in about $200,000 in additional net tuition revenue from an investment of about $50,000 in scholarships. By conducting basic machine learning to find patterns in the data and testing hypotheses, Chamberlain and his team were able to help students and the university. Although this use case is simple and nascent and relied on some traditional statistical inference, once machine learning and education begin interacting more often, this simple example can evolve into larger data sets with large solutions.

Although analytics is relatively widespread, we believe higher education has barely scratched the surface of the potential for machine learning. At the same time, we do not mean to suggest that no one is doing this kind of work. Rather, we believe there is room to grow in this area. Because Chamberlain works as an analyst in higher education — specifically enrollment management — he has seen substantial market potential for data science and machine learning. From student recruitment and success to curricular modeling and student-to-faculty ratios, large quantities of data go unused. Across the country, only a few consultants are using data science to assess student recruitment and success, which often results in a one-size-fits-all approach to recruiting, awarding financial aid, and measuring student success. Each graduating high school senior has numerous data points to assess, including location, grades, and parent income. Machine learning can assess data for each student and determine the likelihood that the student will enroll. Once a student enrolls, even more data points can be assessed, such as the living situation, grade on the first calculus exam, and major. Using machine learning, universities can then hone in on student retention and persistence and identify factors that influence student success.

Machine learning could potentially be used to look for patterns on a campus-wide level. Are there conditional probabilities or cluster analyses that suggest a pattern for passing a statistics course? Suppose, for example, that students who earn high marks in math classes are more likely to pass a statistics course. This seems obvious, but machine learning can provide a methodology to confirm or refute this belief.

How could university leadership use this information to increase retention and student success? Consider, for example, a correlation between taking particular courses that are not prerequisites for statistics and doing well in statistics. Using machine learning in exploratory data analysis might help find these kinds of patterns.

Kansas City uses machine learning to prevent potholes before they even form.7 Colleges and universities could consider using machine learning as a preventative tool as well. If an institution maintains detailed records on IT purchases and equipment, machine learning could be applied to IT equipment maintenance or maintenance in general.

For higher education, experts are going to need machine learning and people able to understand these algorithms to make better business decisions. Currently, many universities do not have a chief data scientist or a team of experts to apply machine learning in an official capacity. Therefore, many universities are missing opportunities that machine learning provides. We suspect that the institutions that are using machine learning are not talking about it much, and we encourage them to reach out to us and others to share their successes and challenges.

Recommendations for Adopting Machine Learning

Getting started with machine learning is not as difficult as some might imagine or claim. Universities, colleges, and other educational institutions are in a good position to adopt, start, grow, and implement machine learning projects, given their access to faculty who have mathematical, statistical, and computer science backgrounds. We offer the following high-level recommendations on how to implement machine learning projects at the university level.

Set Clear Expectations of Institutional Needs, Goals, and Requirements

Administrators and faculty should brainstorm about institutional needs that machine learning can help address. Start small with a very narrow question. For example, it might be useful to predict who is most likely to pass a certain difficult class. Are there discernable patterns that can help predict which students will pass calculus? Can machine learning predict enrollment in specific classes? Conversely, are there patterns in institutional data that can help predict which students are likely to earn degrees by using clustering analysis of some type? Also, be sure to have a goal for the potential findings. How can the university use these results to enhance students’ success, boost retention, and enhance student enrollment?

Temper with Realism

Find out if a faculty member or other expert at your institution or nearby can offer an informed opinion about whether the questions being asked can be answered — can the problem be solved by machine learning? Some problems are easy and inexpensive to solve, and others are not. If not, consult with the expert and go back to the drawing board. Make sure you have individuals who can do the proposed work — typically someone with a mathematics, statistics, and programming background. Industry refers to individuals who possess this combination of skills as “magical unicorns.”8 This is where being in higher learning pays off — these talents are usually close by, if not in one person then definitely in a group of people. The challenge for administrators is to be the bridge for people to cross traditional boundaries and make sure people involved pursue this as an interdisciplinary approach on behalf of the institution.

Consider Finances

Can your institution afford to hire a full-time data scientist? In many case, this might not be an option, given the salaries that such individuals command.9 A reasonable alternative is to put together a diverse, interdisplinary team of volunteers together who agree to do this work. The cost, so to speak, is whatever commitment the institution is willing to make in time investment from having members of that team engaged in other activities. Also important is to be aware that some investment in computing or storage may be required, but depending on how up-to-date your institution’s technology infrastructure is, this might not be necessary.

Realize That This Work Takes Time and Can Be Complicated

Being realistic about the overhead upfront tempers expectations. Your institution may have loads of historical data, but they might be in legacy systems or there may be technical hurdles that make it difficult easily access those data. The value is there, but it may take time to come up with a solution that is easy for everyone to use. Depending on questions administrators or faculty ask, it may also take some time to do proper data analysis. The key is to be patient and strategic. If you commit to doing machine learning, play the long game.

Understand That Security and Privacy Are Paramount to Machine Learning

There are likely local, state, and federal guidelines and laws that educational institutions must adhere to in order to safeguard their data. Before moving further on machine learning, all data should be as secure as possible. In addition, privacy of individuals must be protected. Most industries process, clean, and store data so that no individually discernable information may be gleaned out of it.

Do Machine Learning

The combination of imaginative, creative, and capable people means that new applications, innovations, and benefits are being found very quickly. If you do not have a group of individuals who can currently do machine learning, then find people interested and invest in them to do it. There are many resources in online learning and education to teach data science. Many of these resources are free, and some offer certifications at a reasonable price. The bar of entry is lower than you might think. Many of the programming and data science tools are free.

Ethical Considerations and Limitations

While we believe in the future of machine learning, it always pays to be cautious when adopting new technology. Machine learning is powerful. As the saying goes, with great power comes great responsibility. Often, in the excitement, it can be easy to lose site of the downsides of a new technology or tool. Machine learning provides analysts and decision makers with previously undreamed powers due to its ability to find patterns, make predictions, and draw inferences. The examples below can serve as cautionary tales and motivate questions regarding ethical considerations and the potential limitations of machine learning. In other words, just because we can do something does not imply or suggest we must. Any machine learning project should respect the institution’s policies and mission.

Respect Privacy

One of the earliest examples of using machine learning in predictive analytics came about from an incident in which Target sent coupons to a woman it determined was likely to be pregnant. The story goes something like this: The indignant father went to Target and complained to management that it sent coupons addressed to his teenage daughter with advertisements for maternity clothing and baby furniture. The store management apologized, but the father later contacted them and produced his own apology when he learned his daughter indeed was pregnant.10 How can machine learning techniques determine that a woman was pregnant? The short answer is that we are creatures of habit. Therefore, human behavior and patterns in data collected by companies can be used to identify emerging trends, such as pregnancy.11 The open question presented here is to ask if we always should.

Consider the Implications

More recently, machine-learning techniques were successfully used to infer the sexual orientation of individuals based on their facial features using data from dating websites. Specifically, two researchers from Stanford University trained an AI system to detect patterns in facial features and use this to identify the sexual orientation of a random male (with accuracy of 81%) and for a random woman (71% accurate).12 This is much higher than the reported capability of humans. This research has generated questions of whether such capabilities mean it is also possible to infer a person’s political orientation and IQ from their appearance.13

Currently, these are cutting-edge findings and research. Frankly, we are somewhat skeptical of the findings as a new form of pseudoscience. We present them as probable use cases where machine learning should not have even been applied to begin with. Machine learning should service the mission of higher education to reduce bias and prejudice in human society, not potentially promote it.

Insist on Appropriate Goals

Some controversial work has appeared in detecting criminality based on facial features. Researchers demonstrated that it was possible to infer criminality based solely on still face images using common machine learning techniques.14 Academia and media alike harshly criticized the findings as a new form of craniometrics and pseudoscience.15 One risk of data science is to create difficulty in understanding artificial intelligence systems based on questionable or pseudoscientific ideas.

These examples lead to one invariable and fundamental conclusion regarding the ethical implications of machine learning: We must be careful that machine learning is not abused, resulting in either intentional or unintentional biases or exclusionary analyses, predictions, and artificial intelligence systems. Some nascent research demonstrates that this is possible — research has shown that political leanings can influence how an artificial intelligence system might pick synonyms for political hashtags.16

These examples are not intended to create fear or dissuade readers from pursuing machine learning. Rather, we hope to generate a positive discussion about machine learning and how it can be carefully, responsibly, and maturely applied. Colleges, universities, and other educational institutions should define clear standards so that machine learning projects do not violate ethical standards and stay true to institutional goals and high standards. In fact, this is an opportunity for higher education to lead society by doing things the right way. One way to address many of these issues is to recruit a diverse, inclusive team of experts to analyze data carefully in an ethical and sound way. This is an easy and natural strength for universities, colleges, and other academic organizations.

Conclusion

Machine learning shows great potential to disrupt how we process and consume data and use software. Serious ethical considerations and limitations must be considered. However, higher education is naturally and uniquely positioned to capitalize on the promise of machine learning by using it as a tool for social and moral good. Higher education has the opportunity not only to use machine learning to help transform itself to make better decisions but also to explore how it might apply machine learning as a force for good. How can machine learning relate to and benefit higher education? Considering the trend towards automation in technology as a guide, we believe that the answer, ultimately, is in everything.

What is machine learning? Intelligence derived from data

Machine learning algorithms learn from data to solve problems that are too complex to solve with conventional programming

Machine learning defined

Machine learning is a branch of artificial intelligence that includes methods, or algorithms, for automatically creating models from data. Unlike a system that performs a task by following explicit rules, a machine learning system learns from experience. Whereas a rule-based system will perform a task the same way every time (for better or worse), the performance of a machine learning system can be improved through training, by exposing the algorithm to more data.

Machine learning algorithms are often divided into supervised (the training data are tagged with the answers) and unsupervised (any labels that may exist are not shown to the training algorithm). Supervised machine learning problems are further divided into classification (predicting non-numeric answers, such as the probability of a missed mortgage payment) and regression (predicting numeric answers, such as the number of widgets that will sell next month in your Manhattan store).

Unsupervised learning is further divided into clustering (finding groups of similar objects, such as running shoes, walking shoes, and dress shoes), association (finding common sequences of objects, such as coffee and cream), and dimensionality reduction (projection, feature selection, and feature extraction).

[ When the robots come: ChatGPT and the ethics of AI ]

Applications of machine learning

We hear about applications of machine learning on a daily basis, although not all of them are unalloyed successes. Self-driving cars are a good example, where tasks range from simple and successful (parking assist and highway lane following) to complex and iffy (full vehicle control in urban settings, which has led to several deaths).

Game-playing machine learning is strongly successful for checkers, chess, shogi, and Go, having beaten human world champions. Automatic language translation has been largely successful, although some language pairs work better than others, and many automatic translations can still be improved by human translators.

Automatic speech to text works fairly well for people with mainstream accents, but not so well for people with some strong regional or national accents; performance depends on the training sets used by the vendors. Automatic sentiment analysis of social media has a reasonably good success rate, probably because the training sets (e.g. Amazon product ratings, which couple a comment with a numerical score) are large and easy to access.

Automatic screening of résumés is a controversial area. Amazon had to withdraw its internal system because of training sample biases that caused it to downgrade all job applications from women.Nominations are open for the 2024 Best Places to Work in IT

Other résumé screening systems currently in use may have training biases that cause them to upgrade candidates who are “like” current employees in ways that legally aren’t supposed to matter (e.g. young, white, male candidates from upscale English-speaking neighborhoods who played team sports are more likely to pass the screening). Research efforts by Microsoft and others focus on eliminating implicit biases in machine learning.

Automatic classification of pathology and radiology images has advanced to the point where it can assist (but not replace) pathologists and radiologists for the detection of certain kinds of abnormalities. Meanwhile, facial identification systems are both controversial when they work well (because of privacy considerations) and tend not to be as accurate for women and people of color as they are for white males (because of biases in the training population).

Machine learning algorithms

Machine learning depends on a number of algorithms for turning a data set into a model. Which algorithm works best depends on the kind of problem you’re solving, the computing resources available, and the nature of the data. No matter what algorithm or algorithms you use, you’ll first need to clean and condition the data.

Let’s discuss the most common algorithms for each kind of problem.

Classification algorithms

A classification problem is a supervised learning problem that asks for a choice between two or more classes, usually providing probabilities for each class. Leaving out neural networks and deep learning, which require a much higher level of computing resources, the most common algorithms are Naive Bayes, Decision Tree, Logistic Regression, K-Nearest Neighbors, and Support Vector Machine (SVM). You can also use ensemble methods (combinations of models), such as Random Forest, other Bagging methods, and boosting methods such as AdaBoost and XGBoost.

Regression algorithms

A regression problem is a supervised learning problem that asks the model to predict a number. The simplest and fastest algorithm is linear (least squares) regression, but you shouldn’t stop there, because it often gives you a mediocre result. Other common machine learning regression algorithms (short of neural networks) include Naive Bayes, Decision Tree, K-Nearest Neighbors, LVQ (Learning Vector Quantization), LARS Lasso, Elastic Net, Random Forest, AdaBoost, and XGBoost. You’ll notice that there is some overlap between machine learning algorithms for regression and classification.

Clustering algorithms

A clustering problem is an unsupervised learning problem that asks the model to find groups of similar data points. The most popular algorithm is K-Means Clustering; others include Mean-Shift Clustering, DBSCAN (Density-Based Spatial Clustering of Applications with Noise), GMM (Gaussian Mixture Models), and HAC (Hierarchical Agglomerative Clustering).

Dimensionality reduction algorithms

Dimensionality reduction is an unsupervised learning problem that asks the model to drop or combine variables that have little or no effect on the result. This is often used in combination with classification or regression. Dimensionality reduction algorithms include removing variables with many missing values, removing variables with low variance, Decision Tree, Random Forest, removing or combining variables with high correlation, Backward Feature Elimination, Forward Feature Selection, Factor Analysis, and PCA (Principal Component Analysis).

Optimization methods

Training and evaluation turn supervised learning algorithms into models by optimizing their parameter weights to find the set of values that best matches the ground truth of your data. The algorithms often rely on variants of steepest descent for their optimizers, for example stochastic gradient descent (SGD), which is essentially steepest descent performed multiple times from randomized starting points.

Common refinements on SGD add factors that correct the direction of the gradient based on momentum, or adjust the learning rate based on progress from one pass through the data (called an epoch or a batch) to the next.

Neural networks and deep learning

Neural networks were inspired by the architecture of the biological visual cortex. Deep learning is a set of techniques for learning in neural networks that involves a large number of “hidden” layers to identify features. Hidden layers come between the input and output layers. Each layer is made up of artificial neurons, often with sigmoid or ReLU (Rectified Linear Unit) activation functions.

In a feed-forward network, the neurons are organized into distinct layers: one input layer, any number of hidden processing layers, and one output layer, and the outputs from each layer go only to the next layer.

In a feed-forward network with shortcut connections, some connections can jump over one or more intermediate layers. In recurrent neural networks, neurons can influence themselves, either directly, or indirectly through the next layer.

Supervised learning of a neural network is done just like any other machine learning: You present the network with groups of training data, compare the network output with the desired output, generate an error vector, and apply corrections to the network based on the error vector, usually using a backpropagation algorithm. Batches of training data that are run together before applying corrections are called epochs.

As with all machine learning, you need to check the predictions of the neural network against a separate test data set. Without doing that you risk creating neural networks that only memorize their inputs instead of learning to be generalized predictors.

The breakthrough in the neural network field for vision was Yann LeCun’s 1998 LeNet-5, a seven-level convolutional neural network (CNN) for recognition of handwritten digits digitized in 32×32 pixel images. To analyze higher-resolution images, the network would need more neurons and more layers.

Convolutional neural networks typically use convolutional, pooling, ReLU, fully connected, and loss layers to simulate a visual cortex. The convolutional layer basically takes the integrals of many small overlapping regions. The pooling layer performs a form of non-linear down-sampling. ReLU layers, which I mentioned earlier, apply the non-saturating activation function f(x) = max(0,x).

In a fully connected layer, the neurons have full connections to all activations in the previous layer. A loss layer computes how the network training penalizes the deviation between the predicted and true labels, using a Softmax or cross-entropy loss for classification or a Euclidean loss for regression.

Natural language processing (NLP) is another major application area for deep learning. In addition to the machine translation problem addressed by Google Translate, major NLP tasks include automatic summarization, co-reference resolution, discourse analysis, morphological segmentation, named entity recognition, natural language generation, natural language understanding, part-of-speech tagging, sentiment analysis, and speech recognition.

In addition to CNNs, NLP tasks are often addressed with recurrent neural networks (RNNs), which include the Long-Short Term Memory (LSTM) model.

The more layers there are in a deep neural network, the more computation it takes to train the model on a CPU. Hardware accelerators for neural networks include GPUs, TPUs, and FPGAs.

Reinforcement learning

Reinforcement learning trains an actor or agent to respond to an environment in a way that maximizes some value, usually by trial and error. That’s different from supervised and unsupervised learning, but is often combined with them.

For example, DeepMind’s AlphaGo, in order to learn to play (the action) the game of Go (the environment), first learned to mimic human Go players from a large data set of historical games (apprentice learning). It then improved its play by trial and error (reinforcement learning), by playing large numbers of Go games against independent instances of itself.

Robotic control is another problem that has been attacked with deep reinforcement learning methods, meaning reinforcement learning plus deep neural networks, the deep neural networks often being CNNs trained to extract features from video frames.

How to use machine learning

How does one go about creating a machine learning model? You start by cleaning and conditioning the data, continue with feature engineering, and then try every machine-learning algorithm that makes sense. For certain classes of problem, such as vision and natural language processing, the algorithms that are likely to work involve deep learning.

Data cleaning for machine learning

There is no such thing as clean data in the wild. To be useful for machine learning, data must be aggressively filtered. For example, you’ll want to:

  1. Look at the data and exclude any columns that have a lot of missing data.
  2. Look at the data again and pick the columns you want to use (feature selection) for your prediction. This is something you may want to vary when you iterate.
  3. Exclude any rows that still have missing data in the remaining columns.
  4. Correct obvious typos and merge equivalent answers. For example, U.S., US, USA, and America should be merged into a single category.
  5. Exclude rows that have data that is out of range. For example, if you’re analyzing taxi trips within New York City, you’ll want to filter out rows with pickup or drop-off latitudes and longitudes that are outside the bounding box of the metropolitan area.

There is a lot more you can do, but it will depend on the data collected. This can be tedious, but if you set up a data-cleaning step in your machine learning pipeline you can modify and repeat it at will.

Data encoding and normalization for machine learning

To use categorical data for machine classification, you need to encode the text labels into another form. There are two common encodings.

One is label encoding, which means that each text label value is replaced with a number. The other is one-hot encoding, which means that each text label value is turned into a column with a binary value (1 or 0). Most machine learning frameworks have functions that do the conversion for you. In general, one-hot encoding is preferred, as label encoding can sometimes confuse the machine learning algorithm into thinking that the encoded column is ordered.