Blog Posts

Preparing data for machine learning algorithms

Blog: Think Data Analytics Blog

Description of the stack and some introductory

In our article, we will use the python programming language with its accompanying libraries ( sklearn, matplotlib, seaborn ) and jupyter notebook as the environment for running . 

The purpose of this post is to show general approaches to data preparation. That is, those manipulations that need to be performed before loading data into a machine learning model. In an ideal world, you would have a completely clean dataset with no outliers or missing values. However, in the real world, such datasets are extremely rare.
Next, we will consider data from Kaggle: ” Mental Health in Tech Survey “.

First look at the dataset and understanding its specifics

It’s hard to work with data without understanding what it is, so let’s load it up and display some statistics.

import pand as  as pd 
 import numpy as np
df = pd . read_csv ( "survey.csv" )
df . head () 

This will give us a first idea of ​​what our data is. Next, let’s look at the dimensions of our tabular data. By executing the code below line by line

df.shape # we will see information about the dimension of our dataframe () # show information about the dimension of the data 
          # description of the index, the number of not-a-number of elements 
df.describe () # shows statistics count, mean, std, min, 25 % -50% -75% percentile, max 
df.nunique () # number of unique values ​​for each column

It would also be nice to see information about the count of each unique value for each column in the dataset:

feature_names = df.columns.tolist () 
 for column in feature_names: 
     print column 
     print df [column] .value_counts (dropna = False ) 

Most of the columns look good, but there are a few that need cleaning up. Examples of invalid data values ​​are below.

Division into training sample and target variable

Since we are now considering the problem of teaching with a teacher (somewhat sublimated – we invented it ourselves, we solve it ourselves), we need to divide it into features for learning and into features for prediction. 

The target variable for the current dataset depends on your goals. For example: based on this data set, you can solve a classification problem (determine the gender of the respondent) or a regression problem (predict the age of the respondent). For further consideration, a classification problem was taken: whether the interviewed person will seek treatment.

features = df . drop ('treatment', 1 )
labels = df ['treatment'] 

Handling data gaps

Often there are no template approaches to this task, since the approaches depend a lot on the context and nature of the data. 

For example, is the data random gaps, or is there a hidden link between the gaps and some other record in the training example?
One simple way to solve this problem is to simply ignore or delete rows that are missing data, throwing them out of our analysis. However, this method can be bad because of the loss of information.
Another way is filling in the gaps, where we replace the missing value in some way. Basic implementations will simply replace all missing values ​​with the mean, median, or constant.
First, let’s figure out what to do with the missing values ​​found in self_employed andwork_interfere . In both cases, the column contains categorical data.
Take the following example, which is a small dataset containing three attributes (weather, temperature, and humidity) to predict if I can play tennis or not.

If we removed all lines with missing values, there would be only one line left, and our predictor would always assume that I should play tennis, since there will be no other options for it to learn from. Suppose we choose to replace the zero temperature value in row 3 with the average instead. In this case, the temperature of line 3 would be artificially reported equal to 65. And this will already allow receiving a negative result from the algorithm with some input parameters.

Scikit-learn provides an implementation to handle gaps

from sklearn . preprocessing import  Imputer  
imputer =  Imputer (missing_values = ' NaN ', strategy = 'mean', axis =  0 )
imputer . fit (features)  
features = imputer . transform (features) 

Finding Implicit Duplicates

As mentioned earlier, there are 49 different values for “ gender ”, and it was suspected that some of these values ​​should not be considered as distinct categories. Ultimately, for simplicity, we will divide the data into 3 categories: man, woman, and others (this includes those categories that can be unambiguously excluded from the previous two, for example, transgender).

If you wanted to create a preprocessing mechanism that could clean up incoming data, you would need to take a smarter approach. But since our task is to work with an already existing dataset, we simply use this approach with the replacement of certain types.

male_terms = [ "male" , "m" , "mal" , "msle" , "malr" , "mail" , "make" , "cis male" , "man" , "maile" , "male (cis)" , "cis man" ]
female_terms = [ "female" , "f" , "woman" , "femake" , "femaile" , "femake" , "cis female" , "cis-female / femme" , "female (cis)" , "femail" , "cis woman" ]

def  clean_gender (response): 
     if response.lower (). rstrip () in male_terms: 
         return  "Male"  
    elif response.lower (). rstrip () in female_terms: 
         return  "Female"  
    else :  
         return  "Other" 

df [ 'Gender' ] = df [ "Gender" ] .apply ( lambda x: clean_gender (x)) 

Outlier detection

As mentioned earlier, it turns out that there are values for Age that seem to be erroneous. Such as negative ages or extremely large integers can negatively affect the result of the machine learning algorithm, and we will need to eliminate them.
For this, we take our heuristic estimate of the age at which people can work: from 14 to 100 years old. And all values ​​outside this range are converted to Not-a-Number format .

df.Age.loc [(df.Age < 14) | (df.Age > 100 )] = np.nan

These null values ​​can then be processed using the sklearn Imputer described above .
After determining the range for the working person, we visualize the distribution of the age present in this dataset.

% matplotlib inline    
import seaborn as sns
sns.set (color_codes = True) 
plot = sns.distplot (df.Age.dropna ()) 
 plot .figure.set_size_inches ( 6 , 6 ) 

Data encoding

Many machine learning algorithms expect numeric inputs, so we need to figure out a way to represent our categorical data numerically.

One solution to this would be to randomly assign a numeric value to each category and map the dataset from the original categories to each corresponding number. For example, let’s look at the “ leave ” column (how easy is it for you to take sick leave for a mental health condition?) In our dataset

df ['leave'] . value_counts (dropna = False ) 

Which returns the following values

Don ' t know             563  
Somewhat easy          266  
Very easy              206  
Somewhat difficult     126  
Very difficult          98  
Name : leave, dtype : int64 

To encode this data, we map each value to a number.

df [ 'leave' ] = df [ 'leave' ] .map ({ 'Very difficult' : 0 , 
                                'Somewhat difficult' : 1 , 
                                'Don ' t know ' : 2 , 
                                ' Somewhat easy ' : 3 ,
                                ' Very easy ' : 4 }) 

This process is known as Label Encoding and sklearn can do it for us.

from sklearn import preprocessing
label_encoder = preprocessing . LabelEncoder ()  
label_encoder . fit (df ['leave'])
label_encoder . transform (df ['leave']) 

The problem with this approach is that you are introducing an order that might not be present in the original data. 

In our case, it can be argued that the data is ranked (” Very difficult ” is less than ” Somewhat difficult “, which is less than ” Very easy “, which is less than ” Somewhat easy “), but most of the categorical data is not in order. For example, if you have a sign that denotes the type of animal, often the statement “cat is no longer a dog” makes sense. The danger of coding labels is that your algorithm can learn to favor dogs over cats because of the artificial ordinal values ​​you entered during coding.

A common solution for encoding nominal data isone-hot-encoding .

Instead of replacing categorical value with numeric value (label encoding) as shown below

Instead, we create a column for each value and use 1 and 0 to denote the expression of each value. These new columns are often referred to as dummy variables.

You can do one-hot-encoding directly in Pandas, or use sklearn , although sklearn is a little more transparent since one-hot-encoding from it only works for integer values. In our example (where the inputs are strings), we need to do the encoding of the labels first and then one-hot-encoding .

# Using Pandas  
import pand as  as pd
pd . get_dummies (features ['leave'])

# Using sklearn 
from sklearn . preprocessing import  LabelEncoder , OneHotEncoder  
label_encoder =  LabelEncoder ()    
ohe =  OneHotEncoder (categorical_features = ['leave'])  
label_encoded_data = label_encoder . fit_transform (features ['leave'])
ohe . fit_transform (label_encoded_data . reshape ( - 1 , 1 )) 

Normalizing training data

At this point, we have successfully cleaned up our data and turned it into a form that is suitable for machine learning algorithms. However, at this stage, we must consider whether any method of data normalization is useful for our algorithm. It depends on the data and the algorithm we plan to implement.

ML algorithms that require data normalization:

ML algorithms that do not require data normalization:

Note: The above lists are by no means exhaustive, but merely serve as an example.

Suppose you have a dataset with different units: temperature in Kelvin, relative humidity, and day of the year. We can see the following ranges for each function.

When you look at these values, you intuitively normalize the values. For example, you know that an increase of 0.5 (= 50%) for humidity is much more significant than an increase of 0.5 for temperature. And if we don’t normalize this data, our algorithm can learn to use temperature as the main predictor simply because the scale is the largest (and therefore the changes in temperature values ​​are the most significant for the algorithm). Data normalization allows all features to make the same contribution (or, more accurately, allows features to be added based on their importance rather than their scale).

Normalization Algorithm

If you use a tool like gradient descent to optimize your algorithm, data normalization allows you to consistently update the weights in all dimensions.

The first image represents two functions with different scales, while the last one represents the normalized feature space. Optimization by gradient descent in the first case may take a longer amount of time and ultimately not come to a minimum.

There are several different methods for normalizing data, the most popular are:

Min-max normalization sets the lowest observed value to 0 and the highest observed value to 1.

Normalization to standard deviation.

We can use functions in sklearn to perform normalization .

# Feature scaling with StandardScaler  
from sklearn . preprocessing import  StandardScaler  
scale_features_std =  StandardScaler ()  
features_train = scale_features_std . fit_transform (features_train)
features_test = scale_features_std . transform (features_test)

# Feature scaling with MinMaxScaler  
from sklearn . preprocessing import  MinMaxScaler  
scale_features_mm =  MinMaxScaler ()  
features_train = scale_features_mm . fit_transform (features_train)
features_test = scale_features_mm . transform (features_test) 

A few notes on this implementation:
In practice, you can only select certain columns. For example, you don’t need to normalize dummy variables from one-hot-encoding .

Separating data for training and testing

Splitting data into two subsamples

One of the last things we will need to do in order to prepare the data for training is separating the data into training and test sets. Isolation of the test sample is necessary to understand that we have trained the algorithm sufficiently (there was no overfitting or undertraining)

from sklearn . model_selection import train_test_split

features_train, features_test, labels_train, labels_test = train_test_split (features, labels, test_size = 0.2 , random_state =  0 ) 

Splitting data into three subsamples

You can go further and divide the data into three subsets: training, validation, and lazy fetch. Training data is used to “train” the model, validation data is used to find the best architecture for the model, and lazy sampling is reserved for the final evaluation of our model. 

When building a model, we are often given a choice regarding the overall design of the model; these validations allow us to evaluate multiple projects in search of the best design, but we also “tailor” our model’s design to that subset. Thus, test data is still useful in determining how well our model will generalize what it has learned to new data.

The post Preparing data for machine learning algorithms appeared first on ThinkDataAnalytics.

Leave a Comment

Get the BPI Web Feed

Using the HTML code below, you can display this Business Process Incubator page content with the current filter and sorting inside your web site for FREE.

Copy/Paste this code in your website html code:

<iframe src="" frameborder="0" scrolling="auto" width="100%" height="700">

Customizing your BPI Web Feed

You can click on the Get the BPI Web Feed link on any of our page to create the best possible feed for your site. Here are a few tips to customize your BPI Web Feed.

Customizing the Content Filter
On any page, you can add filter criteria using the MORE FILTERS interface:

Customizing the Content Filter

Customizing the Content Sorting
Clicking on the sorting options will also change the way your BPI Web Feed will be ordered on your site:

Get the BPI Web Feed

Some integration examples