Exploratory Data Analysis with Python
In this article, I am going to analyze a data set from Kaggle. Kaggle is a platform for predictive modeling and analytics competitions in which companies and researchers post data and statisticians and data miners compete to produce the best models for predicting and describing the data.
This is my first ever EDA. I have tried to analyze the data as much as I can but there might be still some missing/wrong points.
The competition that I am going to analyze is about predicting house pricing. You can find the competition by clicking here
So, what is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is the first step in your data analysis process. Here, you make sense of the data you have and then figure out what questions you want to ask and how to frame them, as well as how best to manipulate your available data sources to get the answers you need.
In this example and for all the future examples, I am going to use Python programming language.
Challenge Description
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
Let’s get to work!
When we open the link above, we can see that there are some data files. For this analysis, we only need data_description.txt and train.csv files. the data_description.txt file contains information of each attribute given in train.csv. We will load the train.csv and whenever we need desription of attributes, we will look at data_description.txt.
First see which libraries we are going to use in this project:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
Let’s load the data set from train.csv:
Let’s have a quick look at our data:
Example output that includes first 7 columns:
Data columns (total 81 columns):
Id 1460 non-null int64
MSSubClas 1460 non-null int64
MSZoning 1460 non-null object
LotFrontage 1201 non-null float64
LotArea 1460 non-null int64
Street 1460 non-null object
Alley 91 non-null object
In the output we can see that there are total of 1460 entries and 81 columns. The last column is our class variable, SalePrice, and the first 80 columns are the features that will shape our prediction. Full description of the each column is given in the data_description.txt by Kaggle.
In the output of data_train.info(), We see that next to each column, there are three attributes. In the first one it shows how many times that column is not null for the entire set. And when we check all the columns, we can see that some columns contain less entries. This shows us that there are missing values in our data set. We can also see that there are both numerical and categorical attributes. First of all, I would like to examine the missing data. Missing values in data set can affect prediction or classification of a model negatively. We have seen that there are some missing attributes. Let’s print them out:
null_list = data_train[null_columns].isnull().sum().sort_values(ascending=False)
percent = (null_list /
data_train.isnull().isnull().count()).sort_values(ascending=False)
table_view = pd.concat([null_list, percent[0:null_list.count()]], axis=1,
keys=['Total', 'Percent'])
print(table_view.sort_values(by=['Total'], ascending=False))
Total Percent
PoolQC 1453 0.995205
MiscFeature 1406 0.963014
Alley 1369 0.937671
Fence 1179 0.807534
FireplaceQu 690 0.472603
LotFrontage 259 0.177397
GarageFinish 81 0.055479
GarageQual 81 0.055479
GarageType 81 0.055479
GarageYrBlt 81 0.055479
GarageCond 81 0.055479
BsmtFinType2 38 0.026027
BsmtExposure 38 0.026027
BsmtFinType1 37 0.025342
BsmtQual 37 0.025342
BsmtCond 37 0.025342
MasVnrArea 8 0.005479
MasVnrType 8 0.005479
Electrical 1 0.000685
Missing Data Handling
Let’s look at the definition of those attributes from data_description.txt and handle missing values:
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
NA No Pool
Here missing value for PoolQC means there is no pool in the house, therefore it is actually meaningful. Instead of deleting it, we should put a value into those columns that will tell us it is NA. But what if the house has pool but its quality is actually missing? We can compare the PoolQC with PoolArea values to see that if we actually have missing pool quality.
count 1453.0
mean 0.0
std 0.0
min 0.0
25% 0.0
50% 0.0
75% 0.0
max 0.0
Name: PoolArea, dtype: float64
It seems like whenever PoolQC is null, PoolArea is zero. Therefore we can be sure that those missing PoolQC values actually mean that the house doesn’t have a pool. Now that we are sure, we can put a meaningful data into those columns such as None
Elev Elevator
Gar2 2nd Garage (if not described in garage section)
Othr Other
Shed Shed (over 100 SF)
TenC Tennis Court
NA None
If it is NA, I will assume the house doesn’t have MiscFeature.
Grvl Gravel
Pave Paved
NA No alley Access
If it is NA, I will assume the house has no Alley access.
GdPrv Good Privacy
MnPrv Minimum Privacy
GdWo Good Wood
MnWw Minimum Wood/Wire
NA No Fence
If it is NA, I will assume the house has no Fence.
Ex Excellent - Exceptional Masonry Fireplace
Gd Good - Masonry Fireplace in main level
TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
Fa Fair - Prefabricated Fireplace in basement
Po Poor - Ben Franklin Stove
NA No Fireplace
Just like PoolQC; although NA means no fireplace, there might be some missing data. Let’s check:
count 690.0
mean 0.0
std 0.0
min 0.0
25% 0.0
50% 0.0
75% 0.0
max 0.0
Name: Fireplaces, dtype: float64
For all the null FireplaceQu’s, Fireplaces is zero. We are safe to assume that those houses don’t have a fireplace
There are 259 missing data for the LotFrontage, which is 0.177397 percent. The ratio is not high enough to delete the column because we might lose some valuable information.
Firstly, I would like to check if there is a high correlation between LotFrontage and other attributes.
print(correlation[np.argsort(correlation, axis=0)[::-1]])
LotFrontage 1.000000
stFlrSF 0.457181
LotArea 0.426095
GrLivArea 0.402797
.
.
It seems like there are no high correlation between any attributes. And also there is no strong bond with the class variable. We can actually discard this column but instead, I want to fill it with the median. The LotFrontage may have some similarity with other houses of the same neighborhood. Since Neighborhood is categorical attribute, I will use box plot to examine it.
plt.title("LotFrontage vs Neighborhood")
plt.ylabel("Neighborhood")
plt.xlabel("LotFrontage");
plt.show()
There is no strong correlation between Neighborhood and LotFrontage. But it seems like it is still better to take median(I am not taking mean because of outliers) of the same neighborhood instead of all set.
gr = data_train['LotFrontage'].groupby(data_train['Neighborhood'])
# Take the missing LotFrontage's with their respective Neighborhoods
dt_lot = data_train[data_train['LotFrontage'].isnull()]['Neighborhood']
# For every missing LotFrontage, take their Neighborhood value and retrieve
median of that neighborhood from previously taken GroupBy object
for value in dt_lot:
data_train.loc[data_train['LotFrontage'].isnull(), 'LotFrontage'] = gr.get_group(value).median()
8) GarageQual: Garage quality
9) GarageType: Garage location
10) GarageYrBlt: Year garage was built
11) GarageCond: Garage condition
Let’s handle Garage features together because they are all connected. When we check the null values for the above list, we can see that they are all null at the same rows.
But there are 2 more Garage attributes which are not null. GarageArea and GarageCars.
GarageArea: Size of garage in square feet
print(data_train[garage_columns][data_train['GarageYrBlt'].isnull() == True])
GarageType GarageQual GarageCond GarageYrBlt GarageFinish GarageCars GarageArea
39 NaN NaN NaN NaN NaN 0 0
48 NaN NaN NaN NaN NaN 0 0
78 NaN NaN NaN NaN NaN 0 0
.
.
.
For the null garage attributes, GarageArea and GarageCars are zero. We can fill the numerical null values with zero and categorical null values with None.
if data_train[column].dtype == np.object:
data_train.loc[data[column].isnull(), column] = 'None'
else:
data_train.loc[data[column].isnull(), column] = 0
13) BsmtExposure: Refers to walkout or garden level walls
14) BsmtFinType1: Rating of basement finished area
15) BsmtQual: Evaluates the height of the basement
16) BsmtCond: Evaluates the general condition of the basement
Just like garage attributes, these attributes are null at the same row(Except for the two rows where in one of them only BsmtExposure is null and in the other only BsmtFinType2 is null). When we check the other 3 Basement attributes(BsmtUnfSF, BsmtFinSF1 and BsmtFinSF2) which are not null, we can see that they are zero just like above example. We will do the same technique and fill the numerical attributes with zero and categorical attributes with None, except for those two lines where only one attribute is null. For those lines we will fill the missing value with the most frequent values.
For the missing BsmtExposure:
No 953
Av 221
Gd 134
Mn 114
I will fill the missing BsmtExposure value on line 948 with No which is the most occured value for BsmtExposure
For the missing BsmtFinType2:
Unf 1256
Rec 54
LwQ 46
BLQ 33
ALQ 19
GLQ 14
I will fill the missing BsmtFinType2 value on line 332 with Unf which is the most occured value for BsmtFinType2
For the rest of them we can fill the numerical null values with zero and categorical null values with None.
if data_train[column].dtype == np.object:
data_train.loc[data[column].isnull(), column] = 'None'
else:
data_train.loc[data[column].isnull(), column] = 0
18) MasVnrType: Masonry veneer type
There are only 8 missing values for both attributes and they are from the same house. We can fill MasVnrArea’s with 0 and MasVnrType’s with None, because they are the most occured values.
data_train.loc[data_train['MasVnrType'].isnull(), 'MasVnrType'] = 'None'
SBrkr Standard Circuit Breakers & Romex
FuseA Fuse Box over 60 AMP and all Romex wiring (Average)
FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)
FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)
Mix Mixed
There is only 1 missing data for electrical system. We can fill it with the most frequent value which is SBrkr.
I have also checked if there is any duplicate row by checking Id column, and I have seen that there is no duplicate row.
data_train[data_train.duplicated(keep=False)]
Now we don’t have any missing data in our train set. We can continue exploring the data set.
Correlations and Graphs
Let’s start with checking how the SalePrice is distributed:
sns.distplot(data_train['SalePrice'], color='g')
plt.title("Distribution of Sale Price")
plt.show()
count 1460.000000
mean 180921.195890
std 79442.502883
min 34900.000000
25% 129975.000000
50% 163000.000000
75% 214000.000000
max 755000.000000
Name: SalePrice, dtype: float64
Here we can see that the graph is right skewed, and there are some outliers. But I won’t remove them before understanding the whole data set.
We can check the distribution of each attribute:
print(data_train[data].describe())
count 1460
unique 25
top NAmes
freq 225
Name: Neighborhood, dtype: object
When I check all categorical attributes, I can see that for some attributes a specific category is very dominant. For example;
For the attributes such as Street, Utilities, Condition2, RoofMat1 and Heating the value is same for around %99 of the rows. We can actually discard those attributes as they won’t be very effective. For the visualization, we can look at the box plot of one of them:
plt.show()
Here we can clearly see that AllPub category dominated the Utilities feature, therefore there is no need to take the Utilities into account.
We saw the details of each attributes but it is not enough. We should check correlations between attributes.
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corr_mat, vmax=.8, square=True);
plt.show()
Here we can see correlation between numerical attributes. Let’s print out highly correlated attributes.
threshold = 0.5 # We put threshold because we only want to get highly correlated pairs
# [attributes != 1.0] is written to prevent printing self correlation of an attribute
high_correlations = (attributes[abs(attributes) > threshold][attributes != 1.0]).unstack().dropna().to_dict()
unique_high_correlations = pd.DataFrame( list(set([(tuple(sorted(key)), high_correlations[key]) for key in
high_correlations])), columns=['Attribute Pair', 'Correlation'])
# Sort by value of correlation
unique_high_correlations =
unique_high_correlations.ix[unique_high_correlations['Correlation'].argsort()[::-1]]
print(unique_high_correlations)
Attribute Pair Correlation
(GarageArea, GarageCars) 0.882475
(GrLivArea, TotRmsAbvGrd) 0.825489
(1stFlrSF, TotalBsmtSF) 0.819530
(2ndFlrSF, GrLivArea) 0.687501
(BedroomAbvGr, TotRmsAbvGrd) 0.676620
(BsmtFinSF1, BsmtFullBath) 0.649212
(FullBath, GrLivArea) 0.630012
(2ndFlrSF, TotRmsAbvGrd) 0.616423
(2ndFlrSF, HalfBath) 0.609707
(GarageCars, OverallQual) 0.600671
(GarageCars, GarageYrBlt) 0.598005
(GrLivArea, OverallQual) 0.593007
(YearBuilt, YearRemodAdd) 0.592855
(OverallQual, YearBuilt) 0.572323
(1stFlrSF, GrLivArea) 0.566024
(GarageArea, OverallQual) 0.562022
(GarageArea, GarageYrBlt) 0.560783
(FullBath, TotRmsAbvGrd) 0.554784
(OverallQual, YearRemodAdd) 0.550684
(FullBath, OverallQual) 0.550600
(GarageCars, YearBuilt) 0.537850
(OverallQual, TotalBsmtSF) 0.537808
(BsmtFinSF1, TotalBsmtSF) 0.522396
(BedroomAbvGr, GrLivArea) 0.521270
(2ndFlrSF, BedroomAbvGr) 0.502901
As we can see here, there are some highly correlated attributes. The problem here is that so called Multicollinearity increases the standard errors of the coefficients. Let’s also print attributes which are highly correlated with class attribute and compare it with above table.
correlation = data_train.corr()['SalePrice']
# -1 because the latest row is SalePrice
print(correlation[np.argsort(correlation, axis=0)[::-1]])
SalePrice 1.000000
OverallQual 0.790982
GrLivArea 0.708624
GarageCars 0.640409
GarageArea 0.623431
TotalBsmtSF 0.613581
1stFlrSF 0.605852
FullBath 0.560664
TotRmsAbvGrd 0.533723
YearBuilt 0.522897
YearRemodAdd 0.507101
MasVnrArea 0.472614
Fireplaces 0.466929
BsmtFinSF1 0.386420
LotFrontage 0.330714
WoodDeckSF 0.324413
2ndFlrSF 0.319334
OpenPorchSF 0.315856
HalfBath 0.284108
LotArea 0.263843
GarageYrBlt 0.261366
BsmtFullBath 0.227122
BsmtUnfSF 0.214479
BedroomAbvGr 0.168213
ScreenPorch 0.111447
PoolArea 0.092404
MoSold 0.046432
3SsnPorch 0.044584
BsmtFinSF2 -0.011378
BsmtHalfBath -0.016844
MiscVal -0.021190
Id -0.021917
LowQualFinSF -0.025606
YrSold -0.028923
OverallCond -0.077856
MSSubClass -0.084284
EnclosedPorch -0.128578
KitchenAbvGr -0.135907
Name: SalePrice, dtype: float64
Now I will analyze attributes that have more than 0.5 correlation with class attribute and those that have strong correlation with other attributes.
Attributes that has high correlation with SalePrice:
OverallQual, GrLivArea, GarageCars, GarageArea, TotalBsmtSF, 1stFlrSF, FullBath, TotRmsAbvGrd, YearBuilt, YearRemodAdd
Let’s examine them;
1) OverallQual has correlation with lots of attributes but I think those correlations are not enough to take action.
2) GrLivArea has strong correlation with TotRmsAbvGrd. It is actually pretty normal that the total room count depends on the total living area. But I won’t remove any of them as I think they should be considered separately. Because for the same total living area, some people may prefer fewer rooms with higher space while others prefer more but smaller rooms.
3) GarageArea and GarageCars are both highly correlated with SalePrice, but also they are highly correlated with each other. When we check their meaning;
GarageArea: Size of garage in square feet
It is obvious that garage car capacity depends on the are of the garage. Instead of taking both of them, we may remove one of those attributes. In this case, GarageCars is little bit more correlated with class attribute. So, we can consider taking it and removing GarageArea.
4) TotalBsmtSF and 1stFlrSF are highly correlated with SalePrice as well as each other. considering most of the time basement would be same size of the first floor. We can just take one of them and remove the other. In this case TotalBsmtSF is little bit more correlated with class attribute. Therefore I will ignore 1stFlrSF.
5) YearBuilt and YearRemodAdd are also correlated with each other. In the description of YearRemodAdd, it says; it is same as YearBuilt if there was no re-modelling. But when I check the amount of time they are same, they appeared to be same for 764 times while there are 1460 samples. Which means a lot of house have remodeling or additions. Therefore I won’t delete those attributes.
After this quick comparison, I will create my important numerical feature list that is consist of;
OverallQual, GrLivArea, GarageCars, TotalBsmtSF, FullBath, TotRmsAbvGrd, YearBuilt, YearRemodAdd
We can check their distribution with class variable:
'FullBath', 'TotRmsAbvGrd', 'YearBuilt', 'YearRemodAdd', 'SalePrice']
for i in range(0, len(important_numerical_features), 3):
sns.pairplot(data=data_train, x_vars=important_numerical_features[i:i+3],
y_vars=['SalePrice']),
plt.show()
Now let’s check distribution of categorical attributes. To check their distribution with class variable we can use box plots.
column_count = len(categorical_features.columns)
for i in range(column_count):
if (i % 9) is not 0:
continue
fig, ax = plt.subplots(3, 3, figsize=(10, 8))
sns.boxplot(categorical_features.columns[i], 'SalePrice', data=data_train, ax=ax[0, 0])
if i is (column_count - 1): continue
sns.boxplot(categorical_features.columns[i + 1], 'SalePrice', data=data_train, ax=ax[0, 1])
if i is (column_count - 2): continue
sns.boxplot(categorical_features.columns[i + 2], 'SalePrice', data=data_train, ax=ax[0, 2])
if i is (column_count - 3): continue
sns.boxplot(categorical_features.columns[i + 3], 'SalePrice', data=data_train, ax=ax[1, 0])
if i is (column_count - 4): continue
sns.boxplot(categorical_features.columns[i + 4], 'SalePrice', data=data_train, ax=ax[1, 1])
if i is (column_count - 5): continue
sns.boxplot(categorical_features.columns[i + 5], 'SalePrice', data=data_train, ax=ax[1, 2])
if i is (column_count - 6): continue
sns.boxplot(categorical_features.columns[i + 6], 'SalePrice', data=data_train, ax=ax[2, 0])
if i is (column_count - 7): continue
sns.boxplot(categorical_features.columns[i + 7], 'SalePrice', data=data_train, ax=ax[2, 1])
if i is (column_count - 8): continue
sns.boxplot(categorical_features.columns[i + 8], 'SalePrice', data=data_train, ax=ax[2, 2])
for a in ax.flatten():
for label in a.get_xticklabels():
label.set_rotation(90)
plt.show()
Actually the code to print box plots is very simple;
The reason behind complexity(redundance) of above code is because I wanted to plot box graph of each attribute at the same time and specifically 3 x 3 attributes on per frame. Plotting one graph per page would be a trouble as it pops up a new page for each frame. And plotting every attribute on the same page would make it unreadable, therefore I choose to go this way.
I have also wrote another code which does exactly the same thing.
Second way (Slower):
column_count = len(categorical_features.columns)
attr = 0
result = True
while result is True:
fig, ax = plt.subplots(3, 3, figsize=(10, 8))
for row in range(3):
for column in range(3):
sns.boxplot(categorical_features.columns[attr], 'SalePrice', data=data_train, ax=ax[row, column])
attr = attr + 1
if attr is column_count:
result = False
break
for a in ax.flatten():
for label in a.get_xticklabels():
label.set_rotation(90)
plt.show()
The code looks cleaner than first one but the former is slightly faster. Therefore, I am using it. The calculation time for the first one was around 3.2 seconds while for the second one it was around 3.3 seconds. The difference doesn’t seem much but it can go up if we get more attributes. There might be much more efficient way to do the same job, but I am very new to these seaborn and matplotlib libraries and I have tried to come up with this solution by myself.
Here we can see attributes such as ExterQual, KitchenQual, HeatingQC, BsmtQual and FireplaceQu are highly correlated with SalePrice. When their quality is Excellent, prices tend to go up.
Although analyzing these graphs can give some insight about the effect of categorical attributes on SalePrice, creating custom features by turning those categories into numerical values would make it much more easier for us to see the correlation. To encode categories into ordered numbers, we should have information about their order. For example;
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
We can easily create new feature from HeatingQC whose values would be 1, 2, 3, 4 and 5 (1 = Po, 5 = Ex)
Actually it is possible to encode all categorical attributes into numerical values even if their values can not be ordered. We can achieve this by simply taking mean SalePrice for each category of given attribute and then ordering them. It may be necessary for the feature engineering but I think for this EDA (Exploratory Data Analysis) checking box plots is enough.
Thank you for reading so far! I hope it was somehow useful.