Python Project — Predicting Insurance Premiums


Our simple dataset contains a few attributes for each person such as Age, Sex, BMI, Children, Smoker, Region, and their insurance charges.


To use those attributes information to predict charges for new customers.


Some insight:

Smokers tend to be charged more than non-smokers.

Changing binary categories to 1s and 0s.

Score is the R2 score, which varies between 0 and 100%. It is closely related to the MSE but not the same.

Wikipedia defines r2 like this, “…is the proportion of the variance in the dependent variable that is predictable from the independent variable(s).” Another definition is “(total variance explained by model) / total variance.” So if it is 100%, the two variables are perfectly correlated, i.e., with no variance at all. A low value would show a low level of correlation, meaning a regression model that is not valid, but not in all cases.

Data normalization using StandardScaler.

Linear Regression

Polynomial Regression

Decision Tree Regression

Random Forest Regression

Support Vector Regression

Using 10-Fold Cross Validation, R2 Score, and RMSE.

Evaluating Multiple Linear Regression Model

Evaluating Polynomial Regression Model

Evaluating Decision Tree Regression Model

Evaluating Random Forest Regression Model

Evaluating Support Vector Regression Model

Model Comparison


Our best model is Random Forests Regression using 400 estimators and a max_depth of 5.

for jupyter notebook version:



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store