Blog Posts

Overview of Supervised Machine Learning Algorithms

Blog: Think Data Analytics Blog

How the big picture gives us insights and a better understanding of ML by connecting the dots

There are so many machine learning algorithms out there, and we can find different kinds of overviews and cheat sheets. Why another overview?

I tried to build this different overview with these three main focuses in mind:

When trying to connect the dots of the multitude of machine learning algorithms, I discovered is that there are often several approaches to build/understand an algorithm. For example:

There is no better way, and each approach allows us to understand one aspect. One approach can be easier than another depending on what you already know. My aim in this article is to find a meaningful way to connect them. This also gives me a lot of interesting insights that helped me to better understand them. This overview also becomes a tool, a framework to go deeper for each algorithm.

Each discussion with fellow data scientists allows me to discover more. I will continue to improve it, and your comments are very welcome.

The Big Picture

The idea of this first overview is to map the most common algorithms. Globally, they are organized in a hierarchical structure, with three main categories. And there are also links between algorithms in different categories that we will discuss later.

Overview of supervised machine learning (Image by Author)

To better explain this big picture, we will go through the following themes step by step:

The intuition behind the algorithms

How to define supervised machine learning? To answer this question concretely, one usually comes up with a machine learning algorithm. But which one would you choose? Which one is the most intuitive to understand for beginners?

When I try to decide the order of the three main categories, I wanted to reflect that there is a strong human intuition. For this, let’s mention some classic machine learning problems: house pricing prediction and Titanic survivor prediction. When asking some people with no knowledge of machine learning (because the idea is to relate machine intelligence to human intelligence), the idea of neighbors is easiest for house pricing prediction, and decisions trees are the most intuitive for Titanic survivor prediction.

It happens that applying a math function is the least intuitive way. We will also see that simple math functions are not performant and complex functions are not that intuitive…

Nearest Neighbors

For house price prediction, the answer is often: let’s look at similar housing sold in the neighborhood!

The idea of distance is very intuitive because, to predict a value for a new observation, we can search for similar observations. And mathematically, similarity and neighborhood will be considered as the same thing. The distance that we intuitively perceive is the geometrical distance (in this case, between two houses). This distance can also be generalized for two vectors.

Another reason for which this algorithm is intuitive is that there is no need to train.

Although Nearest Neighbors based algorithms are not frequently used in reality for several reasons, the idea is interesting, and we will be able to relate to other algorithms.

Combination of Rules

When it comes to Titanic survival prediction, a decision tree that is a combination of rules (if-else) is a very intuitive way to explain machine learning. And we often say that for certain business problems, before creating complex algorithms, we can first apply business rules.

We often ask people which rule should first be applied when it comes to predicting the survivorship of a passenger of Titanic. Many people come up with the correct idea of sex as the most important variable. And it is true if we try to build a decision tree. We can see that when people have domain expertise (ladies first!), their intuition is usually adequate with what algorithms predict.

Mathematical functions

In practice, the least intuitive way is to apply a math function. Unless you are an aficionado of linear regression.

All the different algorithms in this category are different mathematical functions (so the input must numerical, we will come back to this later), and then one loss function must be defined.

In this case, it is particularly important to distinguish the function of the model and the loss function. For example, the equation: y=ax+b is the same for these algorithms: linear regression, ridge, lasso, elastic net, and SVM. The difference is the loss function that is used for optimization.

The intuition behind supervised machine learning algorithms (Image by Author)

Model training and usage

Let’s first define some keywords:

Parameter and hyperparameters

Parameters vs hyperparameters for supervised machine learning algorithms (Image by Author)

Model implementation

When creating the model, we are focused on the training process, but it is also important to understand how the model will be implemented. To better understand, we can imagine that you have to implement them in Excel.

Regression and classification

When talking about supervised learning, in many overviews, we often see two sub-categories: regression and classification. As a reminder, a regression problem is when the target variable is continuous whereas a classification task is when the target variable is categorical.

I didn’t choose to present them as two subcategories of supervised learning algorithms because one algorithm can work for one or the other. And let’s also mention how the target variable is used in the training process.

For nearest neighbors algorithms, the target variable is not used for the training process, because only the predictors are used to find the neighbors. So the nature of the target value has no impact on the training process. When the neighbors are found: if the target variable is numerical, then the average value of the neighbors is used for the prediction; if the target variable is categorical (with two or more categories), the proportion of the classes is used for prediction. And the proportion can be considered as the probability. For binary classification or multiclass classification, there is no difference.

For decision tree-based models, the target variable is used to create splits or rules that make it homogeneous in the leaves. The structure of the model is the same. And to make a prediction, we first determine in which leaf is the new observation is. Then: for regression, we predict with the average value of observations in the leaf; for classification, we calculate the proportion of all classes. For binary classification or multiclass classification, we can see that there is no difference.

For mathematical functions models, the target variable is used in a cost function that should be minimized to find the coefficients for each predictor. For categorical variables, one hot encoding is used. Since a mathematical function takes into account as input only numerical values (we will see later), and also output only numerical values, we should a priori say they only work for regression tasks. So for a regression problem, there is no problem. What happens for classification tasks? There are three main solutions, and they can be generalized differently from binary classification to multiclass classification.

Regression and classification (Image by Author)

Feature variable handling

After we considered how the target variable is used in these algorithms, we can now consider feature variables.

Numerical variable and categorical variable

Effect of numerical variable scaling

Feature variables handling (Image by Author)

Missing value handling

Feature importance

Feature importance can be model-agnostic, here we only consider how the algorithm can directly give some insights about how the features are used and therefore their importance in the prediction process.

Missing values and feature importance (Image by Author)

Model enhancement and relationships with other algorithms

After all previous analysis, we now have some ideas of advantages and drawbacks for the three basic algorithms of the three categories of algorithms. Since the shortcomings are different, the ways of improving the models are different. And that is where we can also relate different algorithms of the three different categories.

Nearest neighbors models

We can improve the nearest neighbors in the following ways:

What else? Is there a relationship between Nearest Neighbors and Decision Trees? We can say that the leaves of decision trees contain neighbors. There is no distance used, but neighbors are found in another way, with rules. So you don’t have to store real neighbors to do predictions, you only need to store rules that help you to find your neighbors. And you don’t need them once they are found because only the prediction is needed and it can be stored.

Nearest Neighbours models enhancement and relationships with other algorithms (Image by Author)

Decision tree-based models

For decision trees, the approach to improve the model is mainly aggregating the trees. And aggregation/addition is a mathematical function. So we can relate Boosting to GAM (Generalized Additive Model). It is a very powerful idea because to optimize the trees aggregation in Gradient Boosting Machine, Gradient Descent is used. (Here is an interesting discussion) Loss functions are also considered for various optimization for different tasks.

It is worth noting that Boosting can be used for all sorts of base algorithms, but for linear regression, adding linear regression will still be linear (here and here we can find some interesting discussions). Whereas one interesting characteristic of the decision tree is that it is non-linear by default (here).

Another idea is to introduce some functions for the observations in the leaves. By default, the average value is computed for the regression task for example. Now the idea is to introduce more complex relationships, for example, a linear relationship. We can find some references here.

The approach can be used, in a reversed way for linear regression, instead of creating a single linear relationship for all the scope of the predictors, one can think of cutting the features variables into different regions and creating linear regressions.

Decision tree based models enhancement (Image by Author)

Linear models enhancement

Let’s mention some cases where linear regression fails and how to improve it:

To create non-linear models, there are different approaches:

Animations of Neural Networks Transforming DataAnd we can have a better intuition about why and how neural networks

Linear models enhancement (Image by Author)

For classification algorithms, in particular, we can mention more specificities.

Classification algorithms with math functions (Image by Author)

For the relationship between Linear regression and LDA, there is an interesting discussion here.

For LDA and logistic regression, I wrote an article to show that they are in fact very similar. They are the same model (with the same mathematical function, only the coefficients before the variables are different).Intuitively, How Can We (Better) Understand Logistic RegressionLogistic Regression and Linear Discriminant Analysis are closely related. Here is an intuitive way to understand them…

For more details, you can write this article that I wrote about the transformation of linear classifiers to non-linear classifiers, with some visualizations.


When a general question is asked for all machine learning algorithms, I think that it is important to bear in mind that the answer can be different depending on the algorithm. I hope that this overview can be useful to answer it thoroughly. Please let me know if you have any questions and each discussion helps me to better understand these all algorithms.

Some may notice that there are some algorithms mentioned in the first overview that I didn’t address in the article: ARIMA, RNN, CNN, Attention mechanism, etc. I am writing similar articles about unsupervised learning algorithmscomputer vision techniquesNLP techniquesdeep learning algorithms. Please subscribe si you are interested!

Original Source

The post Overview of Supervised Machine Learning Algorithms appeared first on Big Data, Data Analytics, IOT, Software Testing, Blockchain, Data Lake – Submit Your Guest Post.

Leave a Comment

Get the BPI Web Feed

Using the HTML code below, you can display this Business Process Incubator page content with the current filter and sorting inside your web site for FREE.

Copy/Paste this code in your website html code:

<iframe src="" frameborder="0" scrolling="auto" width="100%" height="700">

Customizing your BPI Web Feed

You can click on the Get the BPI Web Feed link on any of our page to create the best possible feed for your site. Here are a few tips to customize your BPI Web Feed.

Customizing the Content Filter
On any page, you can add filter criteria using the MORE FILTERS interface:

Customizing the Content Filter

Customizing the Content Sorting
Clicking on the sorting options will also change the way your BPI Web Feed will be ordered on your site:

Get the BPI Web Feed

Some integration examples