Churn Prediction using PySpark

Predicting Customer Churn on IBM Cloud

A. Uygur Yiğit
The Startup

--

Sparkify is a fictional digital music streaming provider and users of Sparkify can cancel their subscriptions anytime. We are going to dive deep to churn problems with scalable approach for machine learning on big data.

Background

In this project i have tried to understand the customer behaviors and identify the customers who will cancel their subscription whether in free or paid tier.

I’m working as Marketing Analytics Manager in my company and most of the time i’m analyzing the customer churn behaviors. After analyzing steps we are taking actions to the potential churners.

Why Churn is Important?

Companies make revenue from their customers. This is the basics of the businesses. If one company is losing their customers whatever the reasons, revenues will be decreased. Churn reasons can be customer experience problems, high prices, low quality and so on. It depends on the business area actually.

When we zoom out the scope, there are two types of lifecycle of the customers for every companies.

1- Acquisition

2- Churn

Companies fill their customer pool with acquisition and empty the pool due to the churn. If churn is greater than the acquisition, net customer count decreases. On the other hand, acquiring new customers are generally more expensive than upgrading and satisfying existing customers. Due to these reasons companies need to control churn actions.

Churn in Sparkify

As in every company, Sparkify has churn problems and needs to control the churn action. According to the data provided to us the churn rate is 20% in the Sparkify.

Thanks to this project, company will take actions to the potential churners. After several Machine Learning Models implementation and evaluation, Random Forest Classifier Model is selected for prediction. The model is fed from some journey data of the Customers in Sparkify (Session counts averages, event logs, browser, os and so on)

Sparkify has two types of customers: Free Tier and Paid Tier. For example company can offer Paid Tier Subscription for 1 month to the Free Tier churner customers with no charge or company can offer promotion to the Paid Customers. These kind of actions provide increasing of customer satisfaction willingly or unwillingly also prevent the customers churn actions.

Evaluation Methods

Because of imbalance dataset, Accuracy metric is not suitable evaluation metric. When our churn rate around 20%, then if i say no one will churn. my accuracy will be nearly 80%. This is not make sense.

Instead of Accuracy, F1 Score is suitable for this project. the F1 score conveys the balance between the precision and the recall.

F1 Score Calculation

I analyzed the customers behaviors as in my own profession. In addition to analyzing, I created a Machine Learning model on IBM Cloud Pak Platform for identifying potential churner subscribers using PySpark.

Apache Spark is an open-source distributed general-purpose cluster-computing framework.

PySpark is a Python API to use Spark.

IBM Cloud Pak for Data Provides environments for data analysis and modelling for free with some limitations.

IBM Cloud Pak has several environment options. The project code is written in Spark 3.0 with Python 3.7.

Some environment options in IBM Cloud Pak

Load and Clean Dataset

In this project medium_sparkify_event_data.json dataset analyzed. You can check the this dataset from my GitHub.

Generally we used pyspark.sql for manipulating the data

pyspark.ml library used in Feature Engineering and Modeling parts.

Pandas, matplotlib and seaborn used for analysis besides pyspark.

httpagentparser was useful library for taking OS and Browser from userAgent records.

Libraries Used in the Project

Because of working in Cloud side there is some configuration needed. Thanks to IBM for providing these chunk of codes. There is no specific configuration needed.

Configuration and SparkSession builder

Data Exploration

Before beginning, this is my first time using Spark Dataframe. To be honest Pandas is much more appeals to the eyes. Let’s look at the first 5 rows of the dataset

First 5 rows of the Spark Dataframe

Sparkify has provided a dataset which have 543k rows related to 449 Distinct customers. Actually this is not really big data but it is a good prototype to analyzing and modeling big data. With 543k rows there are 18 columns which are representing customers demographics, app events and so on.

Structure of the Dataset

Some important columns explanations:

1- auths: User authentication status

Values of auths column

2- level: User Plan

Values of User Plan

3- Genders

Values of Gender

3- page :

Type of Pages

To understand the table structure I created an example user. You can see the 6 columns related to the subscriber userId 10. He/She is paid subscriber and using the app in Laurel, MS location.

Events of userId: 10 (I used Pandas this time)

We are modeling the customers so that we need to generate features about every single customers. We need to customers on the rows distinctly and columns will be represent the customers behaviors.

Null Values

Null values

110,828 and 15,700 are two types of null values. We need to investigate it before analysis. It is obvious that artist, length and song have a pattern and demographics info has another. On the other hand in this dataset some userId columns have None values. It is little bit tricky because these values are not null values. For example if you use pandas .isnull() function you can not see that values because they are empty strings.

After some investigation, I saw that there is no userId information in auth column with “Guest” and “logged Out” values. Because when the customer logging off, the information about users disappear in the current system. This kind of rows (without userId) is nonsense for modelling because we don’t know which row belongs to which customer. We have deleted these 15,700 rows.

When auth is “Guest” and “Logged Out” then no userId

Some song related columns are null but this time there is no need to delete rows because we are modelling customers so that if we have userId values for these kind of rows, it can be useful for predicting churner customers. I left it as it.

The data set representing customers behavior between 2018–10 and 2018–12. We have modeled customers with their 2 months adventure. Also dataset have timestamp given in unix time, we need to convert to the appropriate data types later.

Data consist with 2 months information

Data Preperation

This dataset is not clean dataset to analyzing and modelling. First of all we need to identify the target variable which is churn. If one customers page attribute is “Cancel” or “Cancellation Confirm”, i flagged that users as churner.

Label creation using page column

Before creating some features, I defined User Defined Functions(UDF) for reusability.

User Defined Functions

After labeling part I created Operation System and Browser and location variables using UDFs. We can create new columns in the Spark Dataframe using df.withColumn()

Creating new columns using UDF

As i mentioned before, we need to transform unix timestamp to standard date related columns. Here, I converted users events timestamp to the time, date, month, year and yearmonth format. These variabels will be useful for creating other features.

Transforming Timestamps to standard format

As I mentioned before again, dataset contains some Empty userId’s. I removed them.

Empty userId removal

Now we can create features which are related to the every individual. Firstly, I created last_date feature. Because I’m going to create tenure attribute which is last_date — registration. Then I created customers last charging (level) paid or free. Finally, length of the events can be useful. In this case I created mean of the length and max of the length for every individual.

Last Date, Charging and length attributes creation

For example a user can be seen on the pages Thumbsdown, Error, Upgrade. We need to collect this information and assign it to the each customer. I created event counts in daily and monthly averages. It can be useful for modeling.

Event attributes for each subscriber

User — Session relation is very important for the churn problems as my professional experience. So I created average item in one session feature.

Average item in one session for each subscriber

In the continuation of the User — session relation, I created average of monthly and daily session counts for each subscribers. On the other hand I created session lengths averages in daily and monthly. To do that I created grouped_session and grouped_session_length functions for reusability.

grouped_session function
grouped_session_length function
Session related columns creation

Before final step, I created most frequent os and browser attributes for every users using category_freq function.

Function for Finding Most Frequent Categories for each subscribers
Most frequent os and browsers

Finally I merged the all attributes related to the subscribers.

Merge All Attributes

Insights from processed dataset

Now we have more than 18 columns. Let’s look at the all dataset now.

Dataset after feature transformation

We need to analyze customers before modelling. In this case I have created some visualizations about customers.

Churn rate is around 20%

Actually we need to find the churn risky subscribers and the data tells us nearly %20 of the customers are churners.

Most of the users prefer Windows as Os and Chrome as Browser
Some OS and Browser users are behaving different

I investigated the customers operational systems and their web browsers.

1- Windows is the most preffered OS system and Chrome is the most preffered browser

2- Most of the iPhone Users are churner (%65) this is a useful information. Also I have to give this information to the CX team :) probably Customer Experience in the iPhone is really bad.

3- There is no significant difference in the churn rate between browsers but Firefox is slightly higher than the other browsers

No significant difference in Gender

When i looked to the gender distribution. Number of males is slightly higher than the females but the churn rate of the female customers are very close to the male customers churn rate.

Some locations are important

Location can be good feature for the churn prediction because churn rate between locations have huge differences.

Free and paid customers have equal churn rate

It is interesting that paid and free customers have nearly same churn rate.

In the dataset we had 18 columns at the first part and one of them represent the userId. After data cleaning steps, i created features which are:

· Tenure Days

· Daily and monthly averages of the page login counts

· How many items user logged in one session

· Most Frequent Os

· Most Frequent Browser

Feature Engineering

My project did all feature engineering steps using one function which I wrote. This function :

  1. Takes the numeric values except target variable which is churn
  2. Takes categorical variables and transform that using StringIndexer
  3. Encodes the outputs of the StringIndexer and Transform again using OneHotEncoder
  4. Assembles all Numeric and Encoded variables using VectorAssembler
  5. Takes all steps into one pipeline
  6. Transforms all variables using pipeline and creates the Dataset for modeling which have features vector and target column.
Function for Feature Engineering
Output of the function

Now our dataset is ready for modeling. Let’s go through modeling step :)

Modeling

I took 80% Training sample and 20% Validation Sample. After sampling I tried RandomForestClassifier, GradientBoostingClassifier and LogisticRegression separately.

Model and Evaluator Creation

After model creation I fitted the data into the models in one for loop and calculate the results of the prediction on the test set.

Fit, Predict and Evaluate

Actually evaluation part is taking so much time. But for now it is okay for me

First Output of the Code

When we compare the 3 models for each:

- Train Time: Random Forest is the Best
- F1 Score: Random Forest is the Best
- Accuracy Score: Random Forest is the Best
- Precision: Logistic Regression is slightly higher than the Random Forest
- Recall: Random Forest is the Best

When we look at the F1 Score, Random Forest beats the others. On the other hand recall of the Random Forest is 1. It means that the model caught all churners correctly.

Results of RF, GBT and LR

I selected Random Forest for parameter tuning. I believe that it will run much better :)

Random Forest Tuning

In the tuning part, I used function which is created by me for tuning parameters of the Random Forest.

The rf_tuning function takes train, test sets and num_trees and max_depth parameters as arguements and :

  1. Creates RandomForestClassifier
  2. Creates ParamGrid using ParamGridBuilder
  3. Creates 3 Fold Crossvalidator using created RF model
  4. Fit, predict the data and evaluate the results using F1 Score.
Parameter Tuner Function for RF

Tuned Model Evaluation

Basically Random Forest consist of trees. It is an ensemble model. max depth representing the longest path between root node and leaf node.

Max Depth

Max Depth is representing the longest path between root node and leaf node. If Max Depth increase, the model is getting closer to the overfitting

Num Trees

As i mentioned before, Random forest consist of trees. When we select num_trees: 20 the model consist of 20 trees. When Num Trees are increasing generally results are more accurate but of course there are some trade offs which are computational costs!

As a result, I have found the best parameters are max_depth: 10 and num_trees: 20

F1 score increased from 0.80 to 0.82. Our recall score is slightly decreased but it is okey for now. Our model can catch 99 of 100 churners. Also If the model predict the user as churner. It means that 85% probability the user will churn.

Results After Implementation of the Tuning Function
Best Parameters

Feature Importance

Top 16 Features

As you can see in the histogram above, the most important feature is tenure_days. Most of the event attributes take big place for the model. Also daily session count attribute is useful for our churn prediction problem.

Conclusion

As a result we created a churn model with 0.82 F1 Score. We know that the churn rate is around 20% (Let’s assume 30%). If we take churn actions to every customer:

  • precision will be nearly 0.3 (TP/(TP+FP))
  • recall will be 1 (TP/(TP+FN))
  • F1 Score will be nearly 0.46 ~ 0.5

If we assume these scores as a base model, our last models f1 score is 0.82. It is better than the base model. On the other hand RandomForest Classifier beats the Gradient Boosting Classifier and Logistic Regression for our problem.

Reflection

This is my first project in the cloud environment. It is hard to learn the syntax and some operations in the dataset for me because i feel more confident and safe in Pandas but Pandas is not enough for analyzing huge dataset.

On the other hand churn problem is my favorite problem because everyday As i mentioned before, i’m trying to solve problems about the churner subscribers in my profession .

Improvement

Maybe we can improve results in different ways:

  • Trying different types of ML models like The XGBoost and LightGBM to more precise results.
  • Ensembling approaches train several classifiers and combining their predictions.
  • Tuning different parameters in Random Forest Model

GitHub Repo

Hope you find this project useful. For further details you can check my GitHub Page.

--

--