Customer Segmentation Report for Arvato Financial Services

Olga Haberny
7 min readJun 24, 2021

Capstone Project for Udacity Nanodegree Program

source: [https://www.itagroup.com/insights/keys-successful-customer-segmentation-cmb-insights]

Introduction

This post is about my journey with Udacity Data Science course and its final capstone project “Customer Segmentation Report for Arvato Financial Services”. In this project, I was analyzing demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population.

The aim of this project was to create a prediction model that predicts which individuals are most likely to convert into becoming customers for the company and which not.

First part of the project is about data cleaning. Those steps were needed to proceed with deeper data analysis. As soon as the data was prepared, there were two main tasks accomplished, which are creating:

  1. customer segmentation report
  2. supervised learning model

In this post I will go through implementation, showing most important parts of the project.

Data cleaning

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. In this part, data is going to be cleaned (handle missing, incomplete, misleading data) and analyzed through visualizations to have a better understanding of what algorithms and features are appropriate for solving the problem.

There are two datasets provided by the company, used now to analysis:

  • Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
  • Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).

Take a look at AZDIAS set:

First thing was to analyze NaN values in the datasets and decide what to do with them. Plotting number of Nan values per each column gave me a glimpse into data distribution and by looking at the figure below I decided to remove six columns with the highest percentage of the NaN values (here over 65% and up to 99%)

NaN values in columns

The same analysis relates to NaN values in the rows. Of course all steps are done for both datasets: azdias and customers. Here are both results shown. As a result I took only those rows (so those people) who filled in more than 60% columns

NaN values in rows (left — azdias dataset, right — customers dataset)

Great, as soon as I got rid of too many NaN values, I had to deal with the rest of them. I decided to fill NaN values with the mean and median of the other available numbers. So for integers I took mean values and for float median. Now, I still had to manage categorical variables. There were only 6 columns. to look at.

After quick check, there some steps need to clean categorical variables:

  1. repace “X” and “XX” in ‘CAMEO_DEU_2016’ column
  2. drop ‘EINGEFUEGT_AM’ and ‘D19_LETZTER_KAUF_BRANCHE’ (too many categories) and ‘CAMEO_DEUG_2015’ and ‘CAMEO_INTL_2015’ (too similar to ‘CAMEO_DEU_2015’, no extra input to dataset)
  3. replace categories ‘0’, ‘W’ in ‘OST_WEST_KZ’ with binary input

At the end, there is one more check done — check unique values and drop the outliers (see below, here removing LNR column which consists kind of ID-is different to everyone)

unique values in categorical columns

After cleaning the data I resulted with dataset that consists only of int64 data types and has size of:

  1. azdias.shape => (817622, 399) which over 91% rows of original dataset
  2. customers.shape => (145054, 399) which over 75% rows of original dataset

I am aware that probably this is still not enough data cleaning, but I proceeded with the next tasks to see the first results.

Segmentation

This part is dedicated to unsupervised learning techniques that describe relationship between the demographics of the company’s existing customers and the general population of Germany. The goal here is to be able to describe parts of the general population that are more likely to be part of the customer base, and which not.

In order to do this, I created PCA model to see which components (columns/features) explain most of the variance in the dataset which in other words shows which answers were crucial to see differences and similarities within the given datasets.

Principal Component Analysis

From the graph above it is observed that 221 components explain 95% of variance in the dataset. Now, using only those 221 features I proceeded with k-mean clustering to categorize data.

k-mean clustering

Using elbow method I was able to choose proper number of cluster, which in this case were 15. Now, I was only interested in corner cases, so clusters with the highest and the lowest differences.

As a result I came with the following interpretation, after reconstructing the elements from the clusters:

Clusters 5 and 11 represents should be the focus of a marketing campaign, whereas people in cluster 8 and 10 not necessary. Now, by reconstructing the elements, it was possible to see which group could potentially become a customer and which not. Top one where: “ONLINE_AFFINITAET” and “D19_VERSAND_ANZ_24” or “D19_GESAMT_ANZ_24”, whereas “KBA13_SEG_KOMPAKTKLASSE” and “KBA13_HERST_BMW_BENZ” where on the other side.

Supervised Learning

Knowing which parts of the population are more likely to be customers of the mail-order company, it was time to build a prediction model. The goal here was to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign. Once again data cleaning was needed. I did exactly the same steps, now on the new datasets:

  • - Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
    - Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

I splitted the dataset as follow:

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)(34369, 400) (8593, 400) (34369, 1) (8593, 1)

After experiencing a bit with choosing the right model (started with standard RandomForestClassifier GradientBoostingClassifier) I decided to use imblearn library and BalancedRandomForestClassifier from it, as the data in the given dataset is very unbalanced. Only less than 10% of it has 1 as a response, whereas the rest is 0. This leads to overfitting while using standard classifiers. Additionally using GridSearchCV there were following parameters chosen for the best output:

{'model_brfc__n_estimators': 100, 'model_brfc__n_jobs': 2, 'model_brfc__random_state': 42}
0.669377654523558

The last number is the ROC prediction score (ROC metric — Area Under the Receiver Operating Characteristic Curve), resulting in 66%. That is for sure not the best prediction model and there is still a room for further improvements. The most meaning features taken from the created prediction model are shown below, emphasizing social background (under the name: D19_SOZIALES and D19_KONSUMTYP_MAX.)

This scores leads to rather further place in kaggle competitions and shows that there is still so much to change.

Summary and conclusion

The final project was for sure the most challenging one during the whole course and also a great opportunity to summarize and refresh all the knowledge gathered during this time. Surprisingly for me (and probably not for the advanced data scientist) was to see how much time and effort I had to spend just to understand and clean the data. Still, there is for sure so much to do more in this part. I used some basic techniques to finished the given tasks, so there is still so much to improve.

Just to sum up:

  1. First part was dedicated to data cleaning, mostly handling with NaN values and different types of variables (integers, categorical, etc). Removing, replacing and cleaning the data was needed to proceed with further analysis.
  2. Secondly, I did clustering part to find out similarities and differences between azdias und customers datasets. As a result, I found out that:
  3. Finally, using supervised learning and I created classifier to specify which demographic information from each individual is most crucial and predict whether picked person could become a customer of the company.

Thanks for reading! Please check code here to see the details of the project.

--

--