Exploratory Data Analysis with Python

In this article, I am going to analyze a data set from Kaggle. Kaggle is a platform for predictive modeling and analytics competitions in which companies and researchers post data and statisticians and data miners compete to produce the best models for predicting and describing the data.

This is my first ever EDA. I have tried to analyze the data as much as I can but there might be still some missing/wrong points.

The competition that I am going to analyze is about predicting house pricing. You can find the competition by clicking here

So, what is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is the first step in your data analysis process. Here, you make sense of the data you have and then figure out what questions you want to ask and how to frame them, as well as how best to manipulate your available data sources to get the answers you need.

In this example and for all the future examples, I am going to use Python programming language.

Challenge Description

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

Let’s get to work!

When we open the link above, we can see that there are some data files. For this analysis, we only need data_description.txt and train.csv files. the data_description.txt file contains information of each attribute given in train.csv. We will load the train.csv and whenever we need desription of attributes, we will look at data_description.txt.

First see which libraries we are going to use in this project:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Let’s load the data set from train.csv:

data_train = pd.read_csv("train.csv")

Let’s have a quick look at our data:

data_train.info()

Example output that includes first 7 columns:

RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id            1460 non-null int64
MSSubClas     1460 non-null int64
MSZoning      1460 non-null object
LotFrontage   1201 non-null float64
LotArea       1460 non-null int64
Street        1460 non-null object
Alley           91 non-null object

In the output we can see that there are total of 1460 entries and 81 columns. The last column is our class variable, SalePrice, and the first 80 columns are the features that will shape our prediction. Full description of the each column is given in the data_description.txt by Kaggle.
In the output of data_train.info(), We see that next to each column, there are three attributes. In the first one it shows how many times that column is not null for the entire set. And when we check all the columns, we can see that some columns contain less entries. This shows us that there are missing values in our data set. We can also see that there are both numerical and categorical attributes. First of all, I would like to examine the missing data. Missing values in data set can affect prediction or classification of a model negatively. We have seen that there are some missing attributes. Let’s print them out:

null_columns = data_train.columns[data_train.isnull().any(axis=0)]
null_list = data_train[null_columns].isnull().sum().sort_values(ascending=False)
percent = (null_list /
data_train.isnull().isnull().count()).sort_values(ascending=False)
table_view = pd.concat([null_list, percent[0:null_list.count()]], axis=1,
keys=['Total', 'Percent'])
print(table_view.sort_values(by=['Total'], ascending=False))
Output:
             Total   Percent
PoolQC        1453   0.995205
MiscFeature   1406   0.963014
Alley         1369   0.937671
Fence         1179   0.807534
FireplaceQu    690   0.472603
LotFrontage    259   0.177397
GarageFinish    81   0.055479
GarageQual      81   0.055479
GarageType      81   0.055479
GarageYrBlt     81   0.055479
GarageCond      81   0.055479
BsmtFinType2    38   0.026027
BsmtExposure    38   0.026027
BsmtFinType1    37   0.025342
BsmtQual        37   0.025342
BsmtCond        37   0.025342
MasVnrArea       8   0.005479
MasVnrType       8   0.005479
Electrical       1   0.000685

 

Missing Data Handling

Let’s look at the definition of those attributes from data_description.txt and handle missing values:

Here missing value for PoolQC means there is no pool in the house, therefore it is actually meaningful. Instead of deleting it, we should put a value into those columns that will tell us it is NA. But what if the house has pool but its quality is actually missing? We can compare the PoolQC with PoolArea values to see that if we actually have missing pool quality.

print(data_train['PoolArea'][data_train['PoolQC'].isnull() == True].describe())
Output:
count 1453.0
mean     0.0
std      0.0
min      0.0
25%      0.0
50%      0.0
75%      0.0
max      0.0
Name: PoolArea, dtype: float64

It seems like whenever PoolQC is null, PoolArea is zero. Therefore we can be sure that those missing PoolQC values actually mean that the house doesn’t have a pool. Now that we are sure, we can put a meaningful data into those columns such as None

data_train.loc[data_train['PoolQC'].isnull(), 'PoolQC'] = 'None'

If it is NA, I will assume the house doesn’t have MiscFeature.

data_train.loc[data_train['MiscFeature'].isnull(), 'MiscFeature'] = 'None'

If it is NA, I will assume the house has no Alley access.

data_train.loc[data_train['Alley'].isnull(), 'Alley'] = 'None'

If it is NA, I will assume the house has no Fence.

data_train.loc[data_train['Fence'].isnull(), 'Fence'] = 'None'

Just like PoolQC; although NA means no fireplace, there might be some missing data. Let’s check:

print(data_train['Fireplaces'][data_train['FireplaceQu'].isnull() == True].describe())
Output:
count 690.0
mean    0.0
std     0.0
min     0.0
25%     0.0
50%     0.0
75%     0.0
max     0.0
Name: Fireplaces, dtype: float64

For all the null FireplaceQu’s, Fireplaces is zero. We are safe to assume that those houses don’t have a fireplace

data_train.loc[data_train['FireplaceQu'].isnull(), 'FireplaceQu'] = 'None'

There are 259 missing data for the LotFrontage, which is 0.177397 percent. The ratio is not high enough to delete the column because we might lose some valuable information.
Firstly, I would like to check if there is a high correlation between LotFrontage and other attributes.

correlation = data_train.corr()['LotFrontage']
print(correlation[np.argsort(correlation, axis=0)[::-1]])
Output:
LotFrontage 1.000000
stFlrSF     0.457181
LotArea     0.426095
GrLivArea   0.402797
.
.

It seems like there are no high correlation between any attributes. And also there is no strong bond with the class variable. We can actually discard this column but instead, I want to fill it with the median. The LotFrontage may have some similarity with other houses of the same neighborhood. Since Neighborhood is categorical attribute, I will use box plot to examine it.

sns.boxplot(attribute1, attribute2, data=data_set)
plt.title("LotFrontage vs Neighborhood")
plt.ylabel("Neighborhood")
plt.xlabel("LotFrontage");
plt.show()

There is no strong correlation between Neighborhood and LotFrontage. But it seems like it is still better to take median(I am not taking mean because of outliers) of the same neighborhood instead of all set.

# Group LotFrontage values by Neighborhood
gr = data_train['LotFrontage'].groupby(data_train['Neighborhood'])
# Take the missing LotFrontage's with their respective Neighborhoods
dt_lot = data_train[data_train['LotFrontage'].isnull()]['Neighborhood']
# For every missing LotFrontage, take their Neighborhood value and retrieve
median of that neighborhood from previously taken GroupBy object
for value in dt_lot:
data_train.loc[data_train['LotFrontage'].isnull(), 'LotFrontage'] = gr.get_group(value).median()

Let’s handle Garage features together because they are all connected. When we check the null values for the above list, we can see that they are all null at the same rows.
But there are 2 more Garage attributes which are not null. GarageArea and GarageCars.

garage_columns=['GarageType','GarageQual','GarageCond','GarageYrBlt','GarageFinish','GarageCars','GarageArea']
print(data_train[garage_columns][data_train['GarageYrBlt'].isnull() == True])
Output:
   GarageType GarageQual GarageCond GarageYrBlt GarageFinish GarageCars GarageArea
39        NaN        NaN        NaN         NaN          NaN          0          0
48        NaN        NaN        NaN         NaN          NaN          0          0
78        NaN        NaN        NaN         NaN          NaN          0          0
.
.
.

For the null garage attributes, GarageArea and GarageCars are zero. We can fill the numerical null values with zero and categorical null values with None.

for column in ['GarageType', 'GarageQual', 'GarageCond', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea'] :
if data_train[column].dtype == np.object:
data_train.loc[data[column].isnull(), column] = 'None'
else:
data_train.loc[data[column].isnull(), column] = 0

Just like garage attributes, these attributes are null at the same row(Except for the two rows where in one of them only BsmtExposure is null and in the other only BsmtFinType2 is null). When we check the other 3 Basement attributes(BsmtUnfSF, BsmtFinSF1 and BsmtFinSF2) which are not null, we can see that they are zero just like above example. We will do the same technique and fill the numerical attributes with zero and categorical attributes with None, except for those two lines where only one attribute is null. For those lines we will fill the missing value with the most frequent values.
For the missing BsmtExposure:

print(data_train['BsmtExposure'].value_counts())
Output:
No 953
Av 221
Gd 134
Mn 114

I will fill the missing BsmtExposure value on line 948 with No which is the most occured value for BsmtExposure

data_train.loc[data_train.index[948], 'BsmtExposure'] = "No"

For the missing BsmtFinType2:

print(data_train['BsmtFinType2'].value_counts())
Output:
Unf 1256
Rec 54
LwQ 46
BLQ 33
ALQ 19
GLQ 14

I will fill the missing BsmtFinType2 value on line 332 with Unf which is the most occured value for BsmtFinType2

data_train.loc[data_train.index[332], 'BsmtFinType2'] = "Unf"

For the rest of them we can fill the numerical null values with zero and categorical null values with None.

for column in ['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'BsmtUnfSF', 'BsmtFinSF1', 'BsmtFinSF2'] :
if data_train[column].dtype == np.object:
data_train.loc[data[column].isnull(), column] = 'None'
else:
data_train.loc[data[column].isnull(), column] = 0

There are only 8 missing values for both attributes and they are from the same house. We can fill MasVnrArea’s with 0 and MasVnrType’s with None, because they are the most occured values.

data_train.loc[data_train['MasVnrArea'].isnull(), 'MasVnrArea'] = 0
data_train.loc[data_train['MasVnrType'].isnull(), 'MasVnrType'] = 'None'

There is only 1 missing data for electrical system. We can fill it with the most frequent value which is SBrkr.

data_train.loc[data_train[Electrical].isnull(), Electrical] = 'SBrkr'

I have also checked if there is any duplicate row by checking Id column, and I have seen that there is no duplicate row.

# Command to return duplicate rows:
data_train[data_train.duplicated(keep=False)]

Now we don’t have any missing data in our train set. We can continue exploring the data set.

 

Correlations and Graphs

Let’s start with checking how the SalePrice is distributed:

print(data_train['SalePrice'].describe())
sns.distplot(data_train['SalePrice'], color='g')
plt.title("Distribution of Sale Price")
plt.show()
Output:
count   1460.000000
mean  180921.195890
std    79442.502883
min    34900.000000
25%   129975.000000
50%   163000.000000
75%   214000.000000
max   755000.000000
Name: SalePrice, dtype: float64

Here we can see that the graph is right skewed, and there are some outliers. But I won’t remove them before understanding the whole data set.
We can check the distribution of each attribute:

for data in data_train:
print(data_train[data].describe())
Example output from a categorical atribute:
count     1460
unique      25
top      NAmes
freq       225
Name: Neighborhood, dtype: object

When I check all categorical attributes, I can see that for some attributes a specific category is very dominant. For example;
For the attributes such as Street, Utilities, Condition2, RoofMat1 and Heating the value is same for around %99 of the rows. We can actually discard those attributes as they won’t be very effective. For the visualization, we can look at the box plot of one of them:

sns.countplot(x='Utilities', data=data_train)
plt.show()

Here we can clearly see that AllPub category dominated the Utilities feature, therefore there is no need to take the Utilities into account.

We saw the details of each attributes but it is not enough. We should check correlations between attributes.

corr_mat = data.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corr_mat, vmax=.8, square=True);
plt.show()

Here we can see correlation between numerical attributes. Let’s print out highly correlated attributes.

correlations = data.corr() attributes = correlations.iloc[:-1, :-1] # all except target
threshold = 0.5 # We put threshold because we only want to get highly correlated pairs

# [attributes != 1.0] is written to prevent printing self correlation of an attribute
high_correlations = (attributes[abs(attributes) > threshold][attributes != 1.0]).unstack().dropna().to_dict()
unique_high_correlations = pd.DataFrame( list(set([(tuple(sorted(key)), high_correlations[key]) for key in
high_correlations])), columns=['Attribute Pair', 'Correlation'])

# Sort by value of correlation
unique_high_correlations =
unique_high_correlations.ix[unique_high_correlations['Correlation'].argsort()[::-1]]

print(unique_high_correlations)
Output:
             Attribute Pair  Correlation
    (GarageArea, GarageCars)    0.882475
   (GrLivArea, TotRmsAbvGrd)    0.825489
     (1stFlrSF, TotalBsmtSF)    0.819530
       (2ndFlrSF, GrLivArea)    0.687501
(BedroomAbvGr, TotRmsAbvGrd)    0.676620
  (BsmtFinSF1, BsmtFullBath)    0.649212
       (FullBath, GrLivArea)    0.630012
    (2ndFlrSF, TotRmsAbvGrd)    0.616423
        (2ndFlrSF, HalfBath)    0.609707
   (GarageCars, OverallQual)    0.600671
   (GarageCars, GarageYrBlt)    0.598005
    (GrLivArea, OverallQual)    0.593007
   (YearBuilt, YearRemodAdd)    0.592855
    (OverallQual, YearBuilt)    0.572323
       (1stFlrSF, GrLivArea)    0.566024
   (GarageArea, OverallQual)    0.562022
   (GarageArea, GarageYrBlt)    0.560783
    (FullBath, TotRmsAbvGrd)    0.554784
 (OverallQual, YearRemodAdd)    0.550684
     (FullBath, OverallQual)    0.550600
     (GarageCars, YearBuilt)    0.537850
  (OverallQual, TotalBsmtSF)    0.537808
   (BsmtFinSF1, TotalBsmtSF)    0.522396
   (BedroomAbvGr, GrLivArea)    0.521270
    (2ndFlrSF, BedroomAbvGr)    0.502901

As we can see here, there are some highly correlated attributes. The problem here is that so called Multicollinearity increases the standard errors of the coefficients. Let’s also print attributes which are highly correlated with class attribute and compare it with above table.

# Finds the correlation of the given attribute with all attributes
correlation = data_train.corr()['SalePrice']
# -1 because the latest row is SalePrice
print(correlation[np.argsort(correlation, axis=0)[::-1]])
Output:
SalePrice      1.000000
OverallQual    0.790982
GrLivArea      0.708624
GarageCars     0.640409
GarageArea     0.623431
TotalBsmtSF    0.613581
1stFlrSF       0.605852
FullBath       0.560664
TotRmsAbvGrd   0.533723
YearBuilt      0.522897
YearRemodAdd   0.507101
MasVnrArea     0.472614
Fireplaces     0.466929
BsmtFinSF1     0.386420
LotFrontage    0.330714
WoodDeckSF     0.324413
2ndFlrSF       0.319334
OpenPorchSF    0.315856
HalfBath       0.284108
LotArea        0.263843
GarageYrBlt    0.261366
BsmtFullBath   0.227122
BsmtUnfSF      0.214479
BedroomAbvGr   0.168213
ScreenPorch    0.111447
PoolArea       0.092404
MoSold         0.046432
3SsnPorch      0.044584
BsmtFinSF2    -0.011378
BsmtHalfBath  -0.016844
MiscVal       -0.021190
Id            -0.021917
LowQualFinSF  -0.025606
YrSold        -0.028923
OverallCond   -0.077856
MSSubClass    -0.084284
EnclosedPorch -0.128578
KitchenAbvGr  -0.135907
Name: SalePrice, dtype: float64

Now I will analyze attributes that have more than 0.5 correlation with class attribute and those that have strong correlation with other attributes.
Attributes that has high correlation with SalePrice:
OverallQual, GrLivArea, GarageCars, GarageArea, TotalBsmtSF, 1stFlrSF, FullBath, TotRmsAbvGrd, YearBuilt, YearRemodAdd

Let’s examine them;

1) OverallQual has correlation with lots of attributes but I think those correlations are not enough to take action.

2) GrLivArea has strong correlation with TotRmsAbvGrd. It is actually pretty normal that the total room count depends on the total living area. But I won’t remove any of them as I think they should be considered separately. Because for the same total living area, some people may prefer fewer rooms with higher space while others prefer more but smaller rooms.

3) GarageArea and GarageCars are both highly correlated with SalePrice, but also they are highly correlated with each other. When we check their meaning;

It is obvious that garage car capacity depends on the are of the garage. Instead of taking both of them, we may remove one of those attributes. In this case, GarageCars is little bit more correlated with class attribute. So, we can consider taking it and removing GarageArea.

4) TotalBsmtSF and 1stFlrSF are highly correlated with SalePrice as well as each other. considering most of the time basement would be same size of the first floor. We can just take one of them and remove the other. In this case TotalBsmtSF is little bit more correlated with class attribute. Therefore I will ignore 1stFlrSF.

5) YearBuilt and YearRemodAdd are also correlated with each other. In the description of YearRemodAdd, it says; it is same as YearBuilt if there was no re-modelling. But when I check the amount of time they are same, they appeared to be same for 764 times while there are 1460 samples. Which means a lot of house have remodeling or additions. Therefore I won’t delete those attributes.

After this quick comparison, I will create my important numerical feature list that is consist of;
OverallQual, GrLivArea, GarageCars, TotalBsmtSF, FullBath, TotRmsAbvGrd, YearBuilt, YearRemodAdd

We can check their distribution with class variable:

important_numerical_features = ['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF',
'FullBath', 'TotRmsAbvGrd', 'YearBuilt', 'YearRemodAdd', 'SalePrice']
for i in range(0, len(important_numerical_features), 3):
sns.pairplot(data=data_train, x_vars=important_numerical_features[i:i+3],
y_vars=['SalePrice']),
plt.show()

Now let’s check distribution of categorical attributes. To check their distribution with class variable we can use box plots.

categorical_features = data.select_dtypes(include=['object'])
column_count = len(categorical_features.columns)

for i in range(column_count):
if (i % 9) is not 0:
continue
fig, ax = plt.subplots(3, 3, figsize=(10, 8))
sns.boxplot(categorical_features.columns[i], 'SalePrice', data=data_train, ax=ax[0, 0])
if i is (column_count - 1): continue
sns.boxplot(categorical_features.columns[i + 1], 'SalePrice', data=data_train, ax=ax[0, 1])
if i is (column_count - 2): continue
sns.boxplot(categorical_features.columns[i + 2], 'SalePrice', data=data_train, ax=ax[0, 2])
if i is (column_count - 3): continue
sns.boxplot(categorical_features.columns[i + 3], 'SalePrice', data=data_train, ax=ax[1, 0])
if i is (column_count - 4): continue
sns.boxplot(categorical_features.columns[i + 4], 'SalePrice', data=data_train, ax=ax[1, 1])
if i is (column_count - 5): continue
sns.boxplot(categorical_features.columns[i + 5], 'SalePrice', data=data_train, ax=ax[1, 2])
if i is (column_count - 6): continue
sns.boxplot(categorical_features.columns[i + 6], 'SalePrice', data=data_train, ax=ax[2, 0])
if i is (column_count - 7): continue
sns.boxplot(categorical_features.columns[i + 7], 'SalePrice', data=data_train, ax=ax[2, 1])
if i is (column_count - 8): continue
sns.boxplot(categorical_features.columns[i + 8], 'SalePrice', data=data_train, ax=ax[2, 2])
for a in ax.flatten():
for label in a.get_xticklabels():
label.set_rotation(90)

plt.show()

Actually the code to print box plots is very simple;

sns.boxplot(attribute1, attribute2, data=data_set)

The reason behind complexity(redundance) of above code is because I wanted to plot box graph of each attribute at the same time and specifically 3 x 3 attributes on per frame. Plotting one graph per page would be a trouble as it pops up a new page for each frame. And plotting every attribute on the same page would make it unreadable, therefore I choose to go this way.
I have also wrote another code which does exactly the same thing.
Second way (Slower):

categorical_features = data.select_dtypes(include=['object'])
column_count = len(categorical_features.columns)

attr = 0
result = True
while result is True:
fig, ax = plt.subplots(3, 3, figsize=(10, 8))

for row in range(3):
for column in range(3):
sns.boxplot(categorical_features.columns[attr], 'SalePrice', data=data_train, ax=ax[row, column])
attr = attr + 1
if attr is column_count:
result = False
break

for a in ax.flatten():
for label in a.get_xticklabels():
label.set_rotation(90)

plt.show()

The code looks cleaner than first one but the former is slightly faster. Therefore, I am using it. The calculation time for the first one was around 3.2 seconds while for the second one it was around 3.3 seconds. The difference doesn’t seem much but it can go up if we get more attributes. There might be much more efficient way to do the same job, but I am very new to these seaborn and matplotlib libraries and I have tried to come up with this solution by myself.

Here we can see attributes such as ExterQual, KitchenQual, HeatingQC, BsmtQual and FireplaceQu are highly correlated with SalePrice. When their quality is Excellent, prices tend to go up.
Although analyzing these graphs can give some insight about the effect of categorical attributes on SalePrice, creating custom features by turning those categories into numerical values would make it much more easier for us to see the correlation. To encode categories into ordered numbers, we should have information about their order. For example;

We can easily create new feature from HeatingQC whose values would be 1, 2, 3, 4 and 5 (1 = Po, 5 = Ex)
Actually it is possible to encode all categorical attributes into numerical values even if their values can not be ordered. We can achieve this by simply taking mean SalePrice for each category of given attribute and then ordering them. It may be necessary for the feature engineering but I think for this EDA (Exploratory Data Analysis) checking box plots is enough.

Thank you for reading so far! I hope it was somehow useful.

If you liked the content please share it!

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.