Exploratory Data Analysis on an Insurance Dataset
Before doing data analysis, it is important to prepare, clean and explore the data to better help you reach quality insights.
For our exercises on Exploratory Data Analysis, we were tasked to create a storyline for the insurance dataset given to us.
It is always important to check the types of the labels so we will know how we would handle the data based on its kind .As we can see on the general information of the dataset by calling dataframe.info(), dataset has 7 columns with datatypes either int64, object or float64.
By calling the describe() function, we are given a statistical summary of the numerical values, which are age, bmi, children and charges.
Through the details given to us, we can infer that sex, region and smoker labels are either categorical or nominal columns. By furthing inspecting it using value_counts(), we can confirm that these are categorical in nature.
To get a visual representation of the numerical data, we can create a histogram of the columns by using hist() from matplotlib.
I was able to explore the documentation of the functions/commands by using SHIFT Tab in Jupyter! Reading the documentation of each command and the arguments included will really help in understanding the code, rather than just trying some codes you find in the internet out. Through this, I modified the colors, bar widths, and aesthetics of the graphs created.
Let’s try this in seaborn or what they say as pretty plotting!
Well that really is pretty! We can see from the previos graphs that BMI has a gaussian distribution while charges is positively skewed, where most charges are aorund the 0–15000 range.
Now, let’s see some bar graphs for the different categorical values we have by using matplotlib. We can see that there is relatively almost same number of female and male customers, with not a significant difference between number of customers per region. It is also obvious that most of the customers are non-smokers and there were only few who have more than 3 children.
Trying it out in seaborn:
For the previous 2 bar graphs, I’ve been using value_counts() command to be able to achieve those count vs value graphs. Never did I thought there was a command to be able to this without using value_counts() and I’ve learned during our EDA classes. I was able to achieve these through seaborn’s countplot().
Another way to visualize the distribution between bmi stats and age stats is through pie charts. Most of the customers are ranging from obese to extremely obese with most ages around 20–30 and 40–60.
Let’s check the correlation between values by using corr() function.
We can see here that being a smoker highly correlates with the charges that you pay for insurance. BMI and age has some values albeit not that significant, 0.198 and 0.299 respectively. Let’s try to visualize it more using scatterplots.
I tried doing the scatterplots with conditions of being a smoker taken into account but it was really hard. The only way to do it is by using seaborn.
We can confirm through these plots that you are paying more if you are a smoker. However, correlation between charges with respect to age and BMI is not that evident.
To further support our observations, I tried doing a density plot and a boxplot for smokers and non-smokers:
Given our basic and beginner knowledge in exploratory data analysis through pandas and matplotlib, we were able to understand the cleaned insurance dataset and show visualizations for pre-data analysis.
There’s still a lot to learn and a lot more to discover, so I’ll be cutting this blog short.
For now, that’s it!
To access my EDA exercises, you may check out this github repo.