How to predict Churn using Machine Learning

Ankit Bagga
5 min readAug 3, 2021

--

Churn is a complicated business concept and its definition might vary from one business type to the other. But a broader agreed-upon definition of churn is the number of subscribers/customers that leave a provider/business. If you have a subscription-based business, churn might be very explicit where your subscribers might cancel their subscription (thus making them churn), but if you are an e-commerce, you might consider churn when a given customer has not transacted with your business in a given duration of time (mostly a year).

Predicting churn using Machine learning is a classification problem and we will be using supervised machine learning models to try and solve it. Imagine Churn to be a dummy variable (or binary variable) that can take in values either 0 (indicating no-churn) or 1 (indicating churn).

As far as the independent variables are concerned, we have gender, senior citizen, partner, internet service, etc.

The model that we will be using is KNN (K nearest neighbors) classifier. It does what it says it does. It assumes similar things exist in close proximity and assigns classes accordingly.

In this model, we use different values (K) for nearest neighbors to classify an unknown parameter into a given class. If k =1, we check for the nearest neighbor to classify which class a new record belongs to.

Before we start the modeling process, there are some basic pre-processing steps that you should follow. The first one is dealing with “NA” (non-available) values and filling them with 0. Remember you can also choose to remove these from your dataset or fill some other metrics like the mean of the remaining values.

We can drop unnecessary values that won’t be of any use to the model like the customer ID values. We then convert all the categorical variables to dummy variables so that our model can understand them better. Basically what we are doing now is to convert variables with some categories like Gender having male and female values to now have 1 variable called gender_male, which would be 0 for female and 1 for male.

Post this for continuous variables, it’s always a good practice to normalize and scale the data using Standard Scaler. This will make sure that bigger values like Total charges paid by a customer don’t overpower small values of tenure of a customer in years.

Once we have done the pre-processing of the data, the next step is to split the data into test and training sets.

And finally, it’s time to implement the model, by importing it from sklearn.neighbors. In KNNclassifier, one of the most important parameters is the number of neighbors (k) we would want to fit in. Using a lot of neighbors might increase the bias and lead to overfitting (as the model maps out the entire training set) and reducing it too much might lead to low variance and an under-fitted model.

Thus we use GridSearchCV which takes in the name of the model, parameters like a range of values for n_neighbors, and the cross-validation value (here 10).

In a 10 fold classification as used in this example, the data is divided into 10 subsets. Now every time, one of the 10 subsets is used as a test set, and the remaining 9 subsets are used in the training of the model and the process is repeated 10 times and we average out the results. This way we are reducing the probability that the test set might not be representative of the original set. This significantly reduces bias as we are using most of the data for fitting, and also significantly reduces variance as most of the data is also being used in the validation set.

Here we are using a range of n_neighbors from 1 to 99, meaning that the model is run with 10 fold classification and with 100 (1 to 99) possible values for n_neighbors.

The model is then fitted on the training set and the best scores and parameters are calculated to see how the model performs on the training set. Finally, the model is then fitted on the test sets to see how it would perform on an unknown dataset.

In our case, the best value in the case of the test set for the n_neighbors is 23 and the best score is 0.78.

If you plot the mean scores of the 10 fold cross-validation with every n_neighbor value and it seems that the scores initially increase as we increase n_neighbors but as we increase it further the mean scores drop, thus validating our choice of parameter for n_neighbors.

The entire codebase is published on Github at https://github.com/AnkitBagga31/KNNclassifer along with the dataset. Feel free to download it and run the python file in a jupyter notebook to tweak the parameters for the model to make it more accurate. Additionally, you can visit https://ankitbagga.com/ to learn more about the machine learning models you can use to upgrade your marketing strategies.

--

--

Ankit Bagga
Ankit Bagga

Written by Ankit Bagga

Growth marketer | Product Owner | Entrepreneur.

No responses yet