Cross-sell Analysis


Source

Github Page
Code


Introduction

This website is used as the supplementary materials for the project. It contains information which is no included in the poster. It contains all data visualizations we made and detailed process of model training. We hope this website can help the audience make their own conclusions.



The machine learning process in the project


Data Visualization

There are 87.74% of customers who are not interested in vehicle insurance and 12.26% who are interested. If we simply predict all clients as interested, we should expect the prediction accuracy around 12%. Thus, this proportion will be used as the benchmark to evaluate model performance.

Number of missing values in each feature

Distribution of age of clients

Drivers aged above 70 should be consider as high-age drivers which are highlighted in the graph. High-aged drivers have significantly higher risks than other drivers, so the insurance company should pay attention to this group of clients.


The following visualizations are not closely related to our analysis. They are provided as supplementary materials, so readers can make their own conclusions based on these visualizations.

Number of gender by response
Number of cars at different ages by response

Number of cars if previously damaged by response
Distribution of days that customer has been associated with the company

Distribution of annual premium by age


Data Preprocessing


Data Cleaning

The dataset doesn't contain missing values and error values, so data cleaning is not necessary for the dataset.


Feature Engineering

We apply ordinal encoding on categorical features which maps each unique label to an integer value. The details of ordinal encoding are shown below:
    Gender: Female ~ 0, Male ~ 1
    Vehicle_Damage: No ~ 0, Yes ~ 1
    Vehicle_Age: "< 1 Year" ~ 0, "1-2 Year" ~ 1, "> 2 Years" ~ 2

The dataset after feature transformation is shown below.

Part of the data after transformation

After feature transformation, all variables contain numeric values. Thus, we are able to produce a correlation heatmap as shown below.

- Age ~ Vehicle_Age, 0.77:
    Older people tend to have older cars and old-car owners are more interested in vehicle insurance.
- Age ~ Policy _Sales_Channel, -0.58:
    Old people tend to use brokers and agents, young customers use internet.
- Previously_ Insured ~ Vehicle_Damage, -0.82:
    Most customers whose cars are not damaged in the past have vehicle insurance. Drivers with good driving habit should be considered as target clients.

Correlation heatmap


Model Training

Model training process

In the project, we choose logistic regression and random forest models to predict health insurance owners' interest.


Logistic Regression

First, we use all variables to train a logistic regression model. The model summary shows that variables Region_Code and Vintage are insignificant.

Summary of logistic regression using all variables

The AUC score of the model is 0.832

ROC and AUC

We use the intersection of sensitivity (true positive rate) and specificity (true negative rate) as the best cutoff.

Best cutoff

The confusion matrix of the model is shown below.

Confusion matrix of the logistic regression using all variables

Then, we remove insignificant variables Region_Code and Vintage to train another logistic regression model. All variables in the new model are significant.

Summary of logistic regression using only significant variables

The AUC score of the model is 0.832.

ROC and AUC

We use the intersection of sensitivity (true positive rate) and specificity (true negative rate) as the best cutoff.

Best cutoff

The confusion matrix of the model is shown below. This model has slightly better result than the previous one. This means we can achieve better result using a simpler model.

Confusion matrix of the logistic regression using only significant variables

Random Forest

We perform parameter tuning for random forest model. We use ntree=200 and try mtry from 2 to 10. Results of each parameter combination is shown in the table. The random forest using ntree = 200 and mtry = 10 has the best prediction accuracy.
Random forest tuning results

The confusion matrix of the model is shown below.

Confusion matrix of the best random forest


Result Analysis

The process of result analysis

In this section we evaluate model results and choose the best model. We also make recommendations for the insurance company based on results.

Model results
Feature importance ranking by the best model

Logistic regression using only significant variables has the highest accuracy of predicting clients' interest in vehicle insurance, so it is considered as the best model. The feature importance ranking is calculated based on the best model. The top two important features are: Vehicle_Damage and Vehicle_Age.

Based on machine learning results and feature importance, we make the following recommendations to the insurance company:
    1. Use logistic regression using only significant variables to predict clients' interest in vehicle insurance.
    2. Company's marketing strategies should focus on clients who don't damage cars previously and have old cars. It's worth noting that most of clients who damaged cars previously are interested in the vehicle insurance. However, giving insurance to these drivers may encourage them to take risky behaviors again. Thus, to avoid moral hazard, we don't recommend the company to consider them as target clients.