Blog Posts

End-to-end machine learning project: Telco customer churn

Blog: Think Data Analytics Blog

Analyzing IBM telecommunications data (Kaggle dataset)

Predicting customer churn is critical for telecommunication companies to be able to effectively retain customers. It is more costly to acquire new customers than to retain existing ones. For this reason, large telecommunications corporations are seeking to develop models to predict which customers are more likely to change and take actions accordingly.

In this article, we build a model to predict how likely a customer will churn by analyzing its characteristics: (1) demographic information, (2) account information, and (3) services information.

The objective is to obtain a data-driven solution that will allow us to reduce churn rates and, as a consequence, to increase customer satisfaction and corporation revenue.

Data set

The data set used in this article is available in the Kaggle (CC BY-NC-NDand contains nineteen columns (independent variables) that indicate the characteristics of the clients of a fictional telecommunications corporation. The Churn column (response variable) indicates whether the customer departed within the last month or not. The class No includes the clients that did not leave the company last month, while the class Yes contains the clients that decided to terminate their relations with the company. The objective of the analysis is to obtain the relation between the customer’s characteristics and the churn.


The original IBM data can be found in the following link:Telco customer churnThis sample data module tracks a fictional telco company’s customer churn based on various factors.T he churn column…www.ibm.com

The data set available in Kaggle is an adaptation of the original IBM data.

Steps of the project

The project consists of the following sections:

  1. Data Reading
  2. Exploratory Data Analysis and Data Cleaning
  3. Data Visualization
  4. Feature Importance
  5. Feature Engineering
  6. Setting a baseline
  7. Splitting the data in training and testing sets
  8. Assessing multiple algorithms
  9. Algorithm selected: Gradient Boosting
  10. Hyperparameter tuning
  11. Performance of the model
  12. Drawing conclusions — Summary

1. Data Reading

The first step of the analysis consists of reading and storing the data in a Pandas data frame using the pandas.read_csv function.

view rawread_telco.py hosted with by GitHub

Image created by the author

view rawunique_values_telco.py hosted with by GitHub

Image created by the author

As shown above, the data set contains 19 independent variables, which can be classified into 3 groups:

(1) Demographic Information

(2) Customer Account Information

(3) Services Information

2. Exploratory Data Analysis and Data Cleaning

Exploratory data analysis consists of analyzing the main characteristics of a data set usually by means of visualization methods and summary statistics. The objective is to understand the data, discover patterns and anomalies, and check assumptions before performing further evaluations.

Missing values and data types

At the beginning of EDA, we want to know as much information as possible about the data, this is when the pandas.DataFrame.info method comes in handy. This method prints a concise summary of the data frame, including the column names and their data types, the number of non-null values, and the amount of memory used by the data frame.

view rawsummary_telco.py hosted with by GitHub

Image created by the author

As shown above, the data set contains 7043 observations and 21 columns. Apparently, there are no null values on the data set; however, we observe that the column TotalCharges was wrongly detected as an object. This column represents the total amount charged to the customer and it is, therefore, a numeric variable. For further analysis, we need to transform this column into a numeric data type. To do so, we can use the pd.to_numeric function. By default, this function raises an exception when it sees non-numeric data; however, we can use the argument errors='coerce' to skip those cases and replace them with a NaN.

view rawtotal_charges_numeric.py hosted with by GitHub

We can now observe that the column TotalCharges has 11 missing values.

view rawnull_observations.py hosted with by GitHub

Image created by the author

These observations have also a tenure of 0, even though MontlyCharges is not null for these entries. This information appeared to be contradictory, and therefore, we decide to remove those observations from the data set.

view rawdrop_null_values.py hosted with by GitHub

Remove customerID column

The customerID column is useless to explain whether not the customer will churn. Therefore, we drop this column from the data set.

view rawdrop_id.py hosted with by GitHub

Payment method denominations

As shown below, some payment method denominations contain in parenthesis the word automatic. These denominations are too long to be used as tick labels in further visualizations. Therefore, we remove this clarification in parenthesis from the entries of the PaymentMethod column.

view rawunique_payment_method.py hosted with by GitHub

view rawautomatic_remove.py hosted with by GitHub

view rawunique_payment_method_after_modification.py hosted with by GitHub

3. Data Visualization

In this section, we analyze the data by using visualization.

Response Variable

The following bar plot shows the percentage of observations that correspond to each class of the response variableno and yes. As shown below, this is an imbalanced data set because both classes are not equally distributed among all observations, being no the majority class (73.42%). When modeling, this imbalance will lead to a large number of false negatives, as we will see later.

view rawtarget_visualization.py hosted with by GitHub

Image created by the author

In this article, we are going to use normalized stacked bar plots to analyze the influence of each independent categorical variable in the outcome.

normalized stacked bar plot makes each column the same height, so it is not useful for comparing total numbers; however, it is perfect for comparing how the response variable varies across all groups of an independent variable.

On the other hand, we use histograms to evaluate the influence of each independent numeric variable in the outcome. As mentioned before, the data set is imbalanced; therefore, we need to draw a probability density function of each class (density=True) to be able to compare both distributions properly.

Demographic Information

The following code creates a stacked percentage bar chart for each demographic attribute (genderSeniorCitizenPartnerDependents), showing the percentage of Churn for each category of the attribute.

view rawstacked_plot_function.py hosted with by GitHub

view rawdemographic_plot.py hosted with by GitHub

Demographic information — Image created by the author

As shown above, each bar is a category of the independent variable, and it is subdivided to show the proportion of each response class (No and Yes).

We can extract the following conclusions by analyzing demographic attributes:

Customer Account Information — Categorical variables

As we did with demographic attributes, we evaluate the percentage of Churn for each category of the customer account attributes (ContractPaperlessBillingPaymentMethod).

view rawaccount_plot.py hosted with by GitHub

Customer account information — Image created by the author

We can extract the following conclusions by analyzing customer account attributes:

Customer Account Information — Numerical variables

The following plots show the distribution of tenureMontlyChargesTotalCharges by Churn. For all numeric attributes, the distributions of both classes (No and Yes) are different which suggests that all of the attributes will be useful to determine whether or not a customer churns.

view rawhistogram_plot.py hosted with by GitHub

Customer account information — Image created by the author

We can extract the following conclusions by analyzing the histograms above:

Services Information

Lastly, we evaluate the percentage of the target for each category of the services columns with stacked bar plots.

view rawservices_plot.py hosted with by GitHub

Service information — Image created by the author

We can extract the following conclusions by evaluating services attributes:

By looking at the plots above, we can identify the most relevant attributes for detecting churn. We expect these attributes to be discriminative in our future models.

4. Feature importance

Mutual information — analysis of linear and nonlinear relationships

Mutual information measures the mutual dependency between two variables based on entropy estimations. In machine learning, we are interested in evaluating the degree of dependency between each independent variable and the response variable. Higher values of mutual information show a higher degree of dependency which indicates that the independent variable will be useful for predicting the target.

The Scikit-Learn library has implemented mutual information in the metrics package. The following code computes the mutual information score between each categorical variable of the data set and the Churn variable.

view rawmutual_information.py hosted with by GitHub

Image created by the author

Mutual information allows us not only to better understand our data but also to identify the predictor variables that are completely independent of the target. As shown above, genderPhoneService, and MultipleLines have a mutual information score really close to 0, meaning those variables do not have a strong relationship with the target. This information is in line with the conclusions we have previously drawn by visualizing the data. In the following steps, we should consider removing those variables from the data set before training as they do not provide useful information for predicting the outcome.

Feature importance (mutual information score) — Image created by the author

The mutual information extends the notion of correlation to nonlinear relationships since, unlike Pearson’s correlation coefficient, this method is able to detect not only linear relationships but also nonlinear ones.

5. Feature Engineering

Feature engineering is the process of extracting features from the data and transforming them into a format that is suitable for the machine learning model. In this project, we need to transform both numerical and categorical variables. Most machine learning algorithms require numerical values; therefore, all categorical attributes available in the dataset should be encoded into numerical labels before training the model. In addition, we need to transform numeric columns into a common scale. This will prevent that the columns with large values dominate the learning process. The techniques implemented in this project are described in more detail below. All transformations are implemented using only Pandas; however, we also provide an alternative implementation using Scikit-Learn. As you can see, there are multiple ways to solve the same problem 😃.

No modification

The SeniorCitizen column is already a binary column and should not be modified.

Label Encoding

Label encoding is used to replace categorical values with numerical values. This encoding replaces every category with a numerical label. In this project, we use label encoding with the following binary variables: (1) gender, (2) Partner, (3) Dependents, (4)PaperlessBilling, (5)PhoneService , and (6)Churn .

view rawlabel_encoding.py hosted with by GitHub

One-Hot Encoding

One-hot encoding creates a new binary column for each level of the categorical variable. The new column contains zeros and ones indicating the absence or presence of the category in the data. In this project, we apply one-hot encoding to the following categorical variables: (1) Contract, (2) PaymentMethod, (3) MultipleLines, (4) InternetServices, (5) OnlineSecurity, (6) OnlineBackup, (7) DeviceProtection, (8) TechSupport, (9) StreamingTV, and (10)StreamingMovies.

view rawone_hot_encoding.py hosted with by GitHub

The main drawback of this encoding is the significant increase in the dimensionality of the dataset (curse of dimensionality); therefore, this method should be avoided when the categorical column has a large number of unique values.

Normalization

Data Normalization is a common practice in machine learning which consists of transforming numeric columns to a common scale. In machine learning, some feature values differ from others multiple times. The features with higher values will dominate the learning process; however, it does not mean those variables are more important to predict the target. 

Data normalization transforms multiscaled data to the same scale. After normalization, all variables have a similar influence on the model, improving the stability and performance of the learning algorithm.

There are multiple normalization techniques in statistics. In this project, we will use the min-max method to rescale the numeric columns (tenureMontlyCharges, and TotalCharges) to a common scale. The min-max approach (often called normalization) rescales the feature to a fixed range of [0,1] by subtracting the minimum value of the feature and then dividing by the range.

view rawmin_max_normalization.py hosted with by GitHub

6. Setting a baseline

In machine learning, we often use a simple classifier called baseline to evaluate the performance of a model. In this classification problem, the rate of customers that did not churn (most frequent class) can be used as a baseline to evaluate the quality of the models generated. These models should outperform the baseline capabilities to be considered for future predictions.

7. Splitting the data in training and testing sets

The first step when building a model is to split the data into two groups, which are typically referred to as training and testing sets. The training set is used by the machine learning algorithm to build the model. The test set contains samples that are not part of the learning process and is used to evaluate the model’s performance. It is important to assess the quality of the model using unseen data to guarantee an objective evaluation.

Training and testing sets — Image created by the author

First, we create a variable X to store the independent attributes of the dataset. Additionally, we create a variable y to store only the target variable (Churn).

view rawX_y_split.py hosted with by GitHub

Then, we can use the train_test_split function from the sklearn.model_selection package to create both the training and testing sets.

view rawtrain_test_split.py hosted with by GitHub

8. Assessing multiple algorithms

Algorithm selection is a key challenge in any machine learning project since there is not an algorithm that is the best across all projects. Generally, we need to evaluate a set of potential candidates and select for further evaluation those that provide better performance.

In this project, we compare 6 different algorithms, all of them already implemented in Scikit-Learn.

Assessing multiple algorithms — Image created by the author

As shown below, all models outperform the dummy classifier model in terms of prediction accuracy. Therefore, we can affirm that machine learning is applicable to our problem because we observe an improvement over the baseline.

view rawcreate_models.py hosted with by GitHub

view rawtest_models.py hosted with by GitHub

It is important to bear in mind that we have trained all the algorithms using the default hyperparameters. The accuracy of many machine learning algorithms is highly sensitive to the hyperparameters chosen for training the model. A more in-depth analysis will include an evaluation of a wider range of hyperparameters (not only default values) before choosing a model (or models) for hyperparameter tuning. Nonetheless, this is out of the scope of this article. In this example, we will only further evaluate the model that presents higher accuracy using the default hyperparameters. As shown above, this corresponds to the gradient boosting model which shows an accuracy of nearly 80%.

9. Algorithm selected: Gradient Boosting

Gradient Boosting is a very popular machine learning ensemble method based on a sequential training of multiple models to make predictions. In Gradient Boosting, first, you make a model using a random sample of your original data. After fitting the model, you make predictions and compute the residuals of your model. The residuals are the difference between the actual values and the predictions of the model. Then, you train a new tree based on the residuals of the previous tree, calculating again the residuals of this new model. We repeat this process until we reach a threshold (residual close to 0), meaning there is a very low difference between the actual and predicted values. Finally, you take a sum of all model forecasts (prediction of the data and predictions of the error) to make a final prediction.

Gradient boosting classifier — Image created by the author

We can easily build a gradient boosting classifier with Scikit-Learn using the GradientBoostingClassifier class from the sklearn.ensemble module. After creating the model, we need to train it (using the .fit method) and test its performance by comparing the predictions (.predict method) with the actual class values, as you can see in the code above.

As shown in the Scikit-Learn documentation (link below), the GradientBoostingClassifier has multiple hyperparameters; some of them are listed below:

sklearn.ensemble.GradientBoostingClassifierGradient Boosting for classification. GB builds an additive model in a forward stage-wise fashion; it allows for the…scikit-learn.org

The next step consists of finding the combination of hyperparameters that leads to the best classification of our data. This process is called hyperparameter tuning.

10. Hyperparameter tuning

Thus far we have split our data into a training set for learning the parameters of the model, and a testing set for evaluating its performance. The next step in the machine learning process is to perform hyperparameter tuning. The selection of hyperparameters consists of testing the performance of the model against different combinations of hyperparameters, selecting those that perform best according to a chosen metric and a validation method.

For hyperparameter tuning, we need to split our training data again into a set for training and a set for testing the hyperparameters (often called validation set). It is a very common practice to use k-fold cross-validation for hyperparameter tuning. The training set is divided again into k equal-sized samples, 1 sample is used for testing and the remaining k-1 samples are used for training the model, repeating the process k times. Then, the k evaluation metrics (in this case the accuracy) are averaged to produce a single estimator.

It is important to stress that the validation set is used for hyperparameter selection and not for evaluating the final performance of our model, as shown in the image below.

Hyperparameter tuning with cross-validation — Image created by the author

There are multiple techniques to find the best hyperparameters for a model. The most popular methods are (1) grid search, (2) random search, and (3) bayesian optimizationGrid search test all combinations of hyperparameters and select the best performing one. It is a really time-consuming method, particularly when the number of hyperparameters and values to try are really high.

In random search, you specify a grid of hyperparameters, and random combinations are selected where each combination of hyperparameters has an equal chance of being sampled. We do not analyze all combinations of hyperparameters, but only random samples of those combinations. This approach is much more computationally efficient than trying all combinations; however, it also has some disadvantages. The main drawback of random search is that not all areas of the grid are evenly covered, especially when the number of combinations selected from the grid is low.

Grid search vs random search — Image created by the author

We can implement random search in Scikit-learn using the RandomSearchCV class from the sklearn.model_selection package.

First of all, we specify the grid of hyperparameter values using a dictionary (grid_parameters) where the keys represent the hyperparameters and the values are the set of options we want to evaluate. Then, we define the RandomizedSearchCV object for trying different random combinations from this grid. The number of hyperparameter combinations that are sampled is defined in the n_iter parameter. Naturally, increasing n_iter will lead in most cases to more accurate results, since more combinations are sampled; however, on many occasions, the improvement in performance won’t be significant.

view rawrandom_search.py hosted with by GitHub

Image created by the author

After fitting the grid object, we can obtain the best hyperparameters using best_params_attribute. As you can above, the best hyperparameters are: {‘n_estimators’: 90, ‘min_samples_split’: 3, ‘max_features’: ‘log2’, ‘max_depth’: 3}.

11. Performace of the model

The last step of the machine learning process is to check the performance of the model (best hyperparameters ) by using the confusion matrix and some evaluation metrics.

Confusion matrix

The confusion matrix, also known as the error matrix, is used to evaluate the performance of a machine learning model by examining the number of observations that are correctly and incorrectly classified. Each column of the matrix contains the predicted classes while each row represents the actual classes or vice versa. In a perfect classification, the confusion matrix will be all zeros except for the diagonal. All the elements out of the main diagonal represent misclassifications. It is important to bear in mind that the confusion matrix allows us to observe patterns of misclassification (which classes and to which extend they were incorrectly classified).

In binary classification problems, the confusion matrix is a 2-by-2 matrix composed of 4 elements:

Confusion matrix — Image created by the author

Now that the model is trained, it is time to evaluate its performance using the testing set. First, we use the previous model (gradient boosting classifier with best hyperparameters) to predict the class labels of the testing data (with the predict method). Then, we construct the confusion matrix using the confusion_matrix function from the sklearn.metrics package to check which observations were properly classified. The output is a NumPy array where the rows represent the true values and the columns the predicted classes.

view rawconfusion_matrix_telco.py hosted with by GitHub

As shown above, 1402 observations of the testing data were correctly classified by the model (1154 true negatives and 248 true positives). On the contrary, we can observe 356 misclassifications (156 false positives and 200 false negatives).

Evaluation metrics

Evaluating the quality of the model is a fundamental part of the machine learning process. The most used performance evaluation metrics are calculated based on the elements of the confusion matrix.

We can calculate the evaluation metrics manually using the numbers of the confusion matrix. Alternatively, Scikit-learn has already implemented the function classification_report that provides a summary of the key evaluation metrics. The classification report contains the precision, sensitivity, f1-score, and support (number of samples) achieved for each class.

view rawclassification_report_telco.py hosted with by GitHub

Image created by the author

As shown above, we obtain a sensitivity of 0.55 (248/(200+248)) and a specificity of 0.88 (1154/(1154+156)). The model obtained predicts more accurately customers that do not churn. This should not surprise us at all, since gradient boosting classifiers are usually biased toward the classes with more observations.

As you may have noticed, the previous summary does not contain the accuracy of the classification. However, this can be easily calculated using the function accuracy_score from the metrics module.

view rawaccuracy_model.py hosted with ❤ by GitHub

As you can observe, hyperparameter tuning has barely increased the accuracy of the model.

12. Drawing conclusions — Summary

In this post, we have walked through a complete end-to-end machine learning project using the Telco customer Churn dataset. We started by cleaning the data and analyzing it with visualization. Then, to be able to build a machine learning model, we transformed the categorical data into numeric variables (feature engineering). After transforming the data, we tried 6 different machine learning algorithms using default parameters. Finally, we tuned the hyperparameters of the Gradient Boosting Classifier (best performance model) for model optimization, obtaining an accuracy of nearly 80% (close to 6% higher than the baseline).

It is important to stress that the exact steps of a machine learning task vary by project. Although in the article we followed a linear process, machine learning projects tend to be iterative rather than linear processes, where previous steps are often revisited as we learn more about the problem we try to solve.

Original Source

The post End-to-end machine learning project: Telco customer churn appeared first on Big Data, Data Analytics, IOT, Software Testing, Blockchain, Data Lake – Submit Your Guest Post.

Leave a Comment

Get the BPI Web Feed

Using the HTML code below, you can display this Business Process Incubator page content with the current filter and sorting inside your web site for FREE.

Copy/Paste this code in your website html code:

<iframe src="https://www.businessprocessincubator.com/content/end-to-end-machine-learning-project-telco-customer-churn/?feed=html" frameborder="0" scrolling="auto" width="100%" height="700">

Customizing your BPI Web Feed

You can click on the Get the BPI Web Feed link on any of our page to create the best possible feed for your site. Here are a few tips to customize your BPI Web Feed.

Customizing the Content Filter
On any page, you can add filter criteria using the MORE FILTERS interface:

Customizing the Content Filter

Customizing the Content Sorting
Clicking on the sorting options will also change the way your BPI Web Feed will be ordered on your site:

Get the BPI Web Feed

Some integration examples

BPMN.org

XPDL.org

×