Customer Data & Analytics Blog

Customer Churn: A Data Science Perspective

Iqbal Kaur | 5 minute read

Cus Churn.jpg

Understanding customer churn is vital to the success of a business.  Customer churn happens when customers who had shopped with a store for long periods of time stop coming, or move to a competitor. Churn is normal in any business, but it’s also critical to control.

Retaining customers is important because it is much easier the to retail existing customers than bring in new customers and  sometimes customers churn due to dissatisfaction. One dissatisfied customer has the potential to influence many others through word of mouth.

For retaining a customer who is likely to churn, a business needs answer to the following two questions:

  1. Who is going to churn?
  2. When?

How do we approach these questions?

The most popular method of answering the above problem is through creating a predictive model that gives us the following information:

  1. What is the quantified likelihood, most frequently a probability score, of a customer to churn over a certain time frame?
  2. What are the factors that influence a customer to churn?
  3. How does a customer’s churn behavior change by any shift in all or most of these factors?

How do we build a predictive model?

The golden rule for creating models that predict the future is to gather as much information possible from the past. What this really means is that we need to very carefully understand patterns from past behavior that are highly likely to be repeated in the future.

A data scientist would break the above tasks into the following smaller questions:

What data do I need?

Here, one must gather information about past behaviors. One does this by observing a large number of customers and identifying patterns. One needs information on a number of factors that lead us to believe, with confidence, that a certain customer is going to churn out.

 These factors generally vary by type of business, and include:

  1. Was she buying less and less over a period of time?
  2. Were there regular complaints to customer care that were not satisfactorily resolved?
  3. What type of products does she mostly buy? Staple consumables or one time luxury items?
  4. Is the shopping behavior regular or sporadic?

A lot of this information can be picked up from business data bases, like the transaction database or call center data base. It would also add a lot of value to have details about customer brand sentiment. This information can be gathered from social media data. Even though this is very difficult to gather and process, it is of immense value. Once we gather a large set of such factors, we need to pick the ones that are “statistically” important. There are various ways of deciding this. Let’s look at some of the tests that various statistical softwares have. Some of the most popular ones are listed below:

  1. Variable Importance
  2. Weight of Evidence
  3. Information Value

A good discussion about the above can been found here.

What model to choose?

The issue here is that the dependent variable can take only two values. For example, customers who decide to leave are denoted by “1” and those who do not, are denoted by “0”. For this case, there can be few possible models. Below is a brief discussion on some of these models and their pros and cons:




Logistic Regression

1. Very robust modeling technique

2.Gives separate probability score for each customer

1. Modeling process and interpretation may be difficult to understand

Decision Trees

1. Very easy to interpret and can be represented visually

2. The effectiveness does not depend on any assumptions

1. We do not get probability scores for each customer. We only get classification

Neural Network

1. Very fast and accurate results

2. All types of data can be easily trained

1. The process is a blackbox and hence difficult to interpret and convey

Random Forest

1. Can easily handle large number of variables and different types of data

2. Less overfitting compared to other models

1. We do not get probability scores for each customer. We only get classification


1. Can easily handle large number of variables and different types of data

2. Less overfitting compared to other models

1. We do not get probability scores for each customer. We only get classification


How good is my model?

It is crucial to validate your model, which ever you choose. The most popular method is to divide the dataset into test and validation, build models on the set, and validate on the validation set. Some popular diagnostics that are used to validate the model are listed below. These can be used for all the models mentioned above.

1. Confusion Matrix: Here we create a two by two matrix that helps us understand how many have been classified correctly. A large proportion of correct classification indicates a good model uYGgjQZZiOSCmqpJV7ffJkBjdHD4F6oVci7Gmph9n97Yf0Okb9Ci_YRbC1tzorORms2P-1wFbKzh0AFhYSXyWaj09L1c4eUIP5-ml4ZS8PAqqXIA5X308CqO-GIvO21ljvoVm61G.png2. ROC curve: Here we divide the data into deciles and plot the proportion of correct positive classification to incorrect positive classification for each decile. The closer the line to the top left, the better. This indicates that for every one false positive, we get more than more true positive


Final Thoughts

The most important thing to remember here, is that the ultimate conclusion of the model has to make sense for the business. Whichever variable is chosen, and how ever well it classifies does not ensure that it makes business sense. Statistical models mentioned here do not say anything about exact causality. The reasoning about why some customers, who exhibit certain behaviour in months prior to when they leave, can only be explained and validated by business.

Topics: Customer Analytics Customer Intelligence