# Exploratory Data Analysis with Python

In this article, I am going to analyze a data set from Kaggle. ** Kaggle **is a platform for predictive modeling and analytics competitions in which companies and researchers post data and statisticians and data miners compete to produce the best models for predicting and describing the data.

This is my first ever EDA. I have tried to analyze the data as much as I can but there might be still some missing/wrong points.

The competition that I am going to analyze is about predicting house pricing. You can find the competition by clicking here

#### So, what is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is the first step in your data analysis process. Here, you make sense of the data you have and then figure out what questions you want to ask and how to frame them, as well as how best to manipulate your available data sources to get the answers you need.

In this example and for all the future examples, I am going to use Python programming language.

#### Challenge Description

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

#### Let’s get to work!

When we open the link above, we can see that there are some data files. For this analysis, we only need data_description.txt and train.csv files. the data_description.txt file contains information of each attribute given in train.csv. We will load the train.csv and whenever we need desription of attributes, we will look at data_description.txt.

First see which libraries we are going to use in this project:

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

Let’s load the data set from train.csv:

Let’s have a quick look at our data:

Example output that includes first 7 columns:

Data columns (total 81 columns):

Id 1460 non-null int64

MSSubClas 1460 non-null int64

MSZoning 1460 non-null object

LotFrontage 1201 non-null float64

LotArea 1460 non-null int64

Street 1460 non-null object

Alley 91 non-null object

In the output we can see that there are total of 1460 entries and 81 columns. The last column is our class variable, **SalePrice**, and the first 80 columns are the features that will shape our prediction. Full description of the each column is given in the data_description.txt by **Kaggle**.

In the output of **data_train.info()**, We see that next to each column, there are three attributes. In the first one it shows how many times that column is not null for the entire set. And when we check all the columns, we can see that some columns contain less entries. This shows us that there are missing values in our data set. We can also see that there are both numerical and categorical attributes. First of all, I would like to examine the missing data. Missing values in data set can affect prediction or classification of a model negatively. We have seen that there are some missing attributes. Let’s print them out:

null_list = data_train[null_columns].isnull().sum().sort_values(ascending=False)

percent = (null_list /

data_train.isnull().isnull().count()).sort_values(ascending=False)

table_view = pd.concat([null_list, percent[0:null_list.count()]], axis=1,

keys=['Total', 'Percent'])

print(table_view.sort_values(by=['Total'], ascending=False))

Total Percent

PoolQC 1453 0.995205

MiscFeature 1406 0.963014

Alley 1369 0.937671

Fence 1179 0.807534

FireplaceQu 690 0.472603

LotFrontage 259 0.177397

GarageFinish 81 0.055479

GarageQual 81 0.055479

GarageType 81 0.055479

GarageYrBlt 81 0.055479

GarageCond 81 0.055479

BsmtFinType2 38 0.026027

BsmtExposure 38 0.026027

BsmtFinType1 37 0.025342

BsmtQual 37 0.025342

BsmtCond 37 0.025342

MasVnrArea 8 0.005479

MasVnrType 8 0.005479

Electrical 1 0.000685

## Missing Data Handling

Let’s look at the definition of those attributes from data_description.txt and handle missing values:

Ex Excellent

Gd Good

TA Average/Typical

Fa Fair

NA No Pool

Here missing value for PoolQC means there is no pool in the house, therefore it is actually meaningful. Instead of deleting it, we should put a value into those columns that will tell us it is NA. But what if the house has pool but its quality is actually missing? We can compare the PoolQC with PoolArea values to see that if we actually have missing pool quality.

count 1453.0

mean 0.0

std 0.0

min 0.0

25% 0.0

50% 0.0

75% 0.0

max 0.0

Name: PoolArea, dtype: float64

It seems like whenever PoolQC is null, PoolArea is zero. Therefore we can be sure that those missing PoolQC values actually mean that the house doesn’t have a pool. Now that we are sure, we can put a meaningful data into those columns such as None

Elev Elevator

Gar2 2nd Garage (if not described in garage section)

Othr Other

Shed Shed (over 100 SF)

TenC Tennis Court

NA None

If it is NA, I will assume the house doesn’t have MiscFeature.

Grvl Gravel

Pave Paved

NA No alley Access

If it is NA, I will assume the house has no Alley access.

GdPrv Good Privacy

MnPrv Minimum Privacy

GdWo Good Wood

MnWw Minimum Wood/Wire

NA No Fence

If it is NA, I will assume the house has no Fence.

Ex Excellent - Exceptional Masonry Fireplace

Gd Good - Masonry Fireplace in main level

TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement

Fa Fair - Prefabricated Fireplace in basement

Po Poor - Ben Franklin Stove

NA No Fireplace

Just like PoolQC; although NA means no fireplace, there might be some missing data. Let’s check:

count 690.0

mean 0.0

std 0.0

min 0.0

25% 0.0

50% 0.0

75% 0.0

max 0.0

Name: Fireplaces, dtype: float64

For all the null FireplaceQu’s, Fireplaces is zero. We are safe to assume that those houses don’t have a fireplace

There are 259 missing data for the LotFrontage, which is 0.177397 percent. The ratio is not high enough to delete the column because we might lose some valuable information.

Firstly, I would like to check if there is a high correlation between LotFrontage and other attributes.

print(correlation[np.argsort(correlation, axis=0)[::-1]])

LotFrontage 1.000000

stFlrSF 0.457181

LotArea 0.426095

GrLivArea 0.402797

.

.

It seems like there are no high correlation between any attributes. And also there is no strong bond with the class variable. We can actually discard this column but instead, I want to fill it with the median. The LotFrontage may have some similarity with other houses of the same neighborhood. Since Neighborhood is categorical attribute, I will use box plot to examine it.

plt.title("LotFrontage vs Neighborhood")

plt.ylabel("Neighborhood")

plt.xlabel("LotFrontage");

plt.show()

There is no strong correlation between Neighborhood and LotFrontage. But it seems like it is still better to take median(I am not taking mean because of outliers) of the same neighborhood instead of all set.

gr = data_train['LotFrontage'].groupby(data_train['Neighborhood'])

# Take the missing LotFrontage's with their respective Neighborhoods

dt_lot = data_train[data_train['LotFrontage'].isnull()]['Neighborhood']

# For every missing LotFrontage, take their Neighborhood value and retrieve

median of that neighborhood from previously taken GroupBy object

for value in dt_lot:

data_train.loc[data_train['LotFrontage'].isnull(), 'LotFrontage'] = gr.get_group(value).median()

8) GarageQual: Garage quality

9) GarageType: Garage location

10) GarageYrBlt: Year garage was built

11) GarageCond: Garage condition

Let’s handle Garage features together because they are all connected. When we check the null values for the above list, we can see that they are all null at the same rows.

But there are 2 more Garage attributes which are not null. GarageArea and GarageCars.

GarageArea: Size of garage in square feet

print(data_train[garage_columns][data_train['GarageYrBlt'].isnull() == True])

GarageType GarageQual GarageCond GarageYrBlt GarageFinish GarageCars GarageArea

39 NaN NaN NaN NaN NaN 0 0

48 NaN NaN NaN NaN NaN 0 0

78 NaN NaN NaN NaN NaN 0 0

.

.

.

For the null garage attributes, GarageArea and GarageCars are zero. We can fill the numerical null values with zero and categorical null values with None.

if data_train[column].dtype == np.object:

data_train.loc[data[column].isnull(), column] = 'None'

else:

data_train.loc[data[column].isnull(), column] = 0

13) BsmtExposure: Refers to walkout or garden level walls

14) BsmtFinType1: Rating of basement finished area

15) BsmtQual: Evaluates the height of the basement

16) BsmtCond: Evaluates the general condition of the basement

Just like garage attributes, these attributes are null at the same row(Except for the two rows where in one of them only BsmtExposure is null and in the other only BsmtFinType2 is null). When we check the other 3 Basement attributes(BsmtUnfSF, BsmtFinSF1 and BsmtFinSF2) which are not null, we can see that they are zero just like above example. We will do the same technique and fill the numerical attributes with zero and categorical attributes with None, except for those two lines where only one attribute is null. For those lines we will fill the missing value with the most frequent values.

For the missing BsmtExposure:

No 953

Av 221

Gd 134

Mn 114

I will fill the missing BsmtExposure value on line 948 with No which is the most occured value for BsmtExposure

For the missing BsmtFinType2:

Unf 1256

Rec 54

LwQ 46

BLQ 33

ALQ 19

GLQ 14

I will fill the missing BsmtFinType2 value on line 332 with Unf which is the most occured value for BsmtFinType2

For the rest of them we can fill the numerical null values with zero and categorical null values with None.

if data_train[column].dtype == np.object:

data_train.loc[data[column].isnull(), column] = 'None'

else:

data_train.loc[data[column].isnull(), column] = 0

18) MasVnrType: Masonry veneer type

There are only 8 missing values for both attributes and they are from the same house. We can fill MasVnrArea’s with 0 and MasVnrType’s with None, because they are the most occured values.

data_train.loc[data_train['MasVnrType'].isnull(), 'MasVnrType'] = 'None'

SBrkr Standard Circuit Breakers & Romex

FuseA Fuse Box over 60 AMP and all Romex wiring (Average)

FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)

FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)

Mix Mixed

There is only 1 missing data for electrical system. We can fill it with the most frequent value which is SBrkr.

I have also checked if there is any duplicate row by checking Id column, and I have seen that there is no duplicate row.

data_train[data_train.duplicated(keep=False)]

#### Now we don’t have any missing data in our train set. We can continue exploring the data set.

## Correlations and Graphs

Let’s start with checking how the SalePrice is distributed:

sns.distplot(data_train['SalePrice'], color='g')

plt.title("Distribution of Sale Price")

plt.show()

count 1460.000000

mean 180921.195890

std 79442.502883

min 34900.000000

25% 129975.000000

50% 163000.000000

75% 214000.000000

max 755000.000000

Name: SalePrice, dtype: float64

Here we can see that the graph is right skewed, and there are some outliers. But I won’t remove them before understanding the whole data set.

We can check the distribution of each attribute:

print(data_train[data].describe())

count 1460

unique 25

top NAmes

freq 225

Name: Neighborhood, dtype: object

When I check all categorical attributes, I can see that for some attributes a specific category is very dominant. For example;

For the attributes such as Street, Utilities, Condition2, RoofMat1 and Heating the value is same for around %99 of the rows. We can actually discard those attributes as they won’t be very effective. For the visualization, we can look at the box plot of one of them:

plt.show()

Here we can clearly see that AllPub category dominated the Utilities feature, therefore there is no need to take the Utilities into account.

We saw the details of each attributes but it is not enough. We should check correlations between attributes.

f, ax = plt.subplots(figsize=(12, 9))

sns.heatmap(corr_mat, vmax=.8, square=True);

plt.show()

Here we can see correlation between numerical attributes. Let’s print out highly correlated attributes.

threshold = 0.5 # We put threshold because we only want to get highly correlated pairs

# [attributes != 1.0] is written to prevent printing self correlation of an attribute

high_correlations = (attributes[abs(attributes) > threshold][attributes != 1.0]).unstack().dropna().to_dict()

unique_high_correlations = pd.DataFrame( list(set([(tuple(sorted(key)), high_correlations[key]) for key in

high_correlations])), columns=['Attribute Pair', 'Correlation'])

# Sort by value of correlation

unique_high_correlations =

unique_high_correlations.ix[unique_high_correlations['Correlation'].argsort()[::-1]]

print(unique_high_correlations)

Attribute Pair Correlation

(GarageArea, GarageCars) 0.882475

(GrLivArea, TotRmsAbvGrd) 0.825489

(1stFlrSF, TotalBsmtSF) 0.819530

(2ndFlrSF, GrLivArea) 0.687501

(BedroomAbvGr, TotRmsAbvGrd) 0.676620

(BsmtFinSF1, BsmtFullBath) 0.649212

(FullBath, GrLivArea) 0.630012

(2ndFlrSF, TotRmsAbvGrd) 0.616423

(2ndFlrSF, HalfBath) 0.609707

(GarageCars, OverallQual) 0.600671

(GarageCars, GarageYrBlt) 0.598005

(GrLivArea, OverallQual) 0.593007

(YearBuilt, YearRemodAdd) 0.592855

(OverallQual, YearBuilt) 0.572323

(1stFlrSF, GrLivArea) 0.566024

(GarageArea, OverallQual) 0.562022

(GarageArea, GarageYrBlt) 0.560783

(FullBath, TotRmsAbvGrd) 0.554784

(OverallQual, YearRemodAdd) 0.550684

(FullBath, OverallQual) 0.550600

(GarageCars, YearBuilt) 0.537850

(OverallQual, TotalBsmtSF) 0.537808

(BsmtFinSF1, TotalBsmtSF) 0.522396

(BedroomAbvGr, GrLivArea) 0.521270

(2ndFlrSF, BedroomAbvGr) 0.502901

As we can see here, there are some highly correlated attributes. The problem here is that so called Multicollinearity increases the standard errors of the coefficients. Let’s also print attributes which are highly correlated with class attribute and compare it with above table.

correlation = data_train.corr()['SalePrice']

# -1 because the latest row is SalePrice

print(correlation[np.argsort(correlation, axis=0)[::-1]])

SalePrice 1.000000

OverallQual 0.790982

GrLivArea 0.708624

GarageCars 0.640409

GarageArea 0.623431

TotalBsmtSF 0.613581

1stFlrSF 0.605852

FullBath 0.560664

TotRmsAbvGrd 0.533723

YearBuilt 0.522897

YearRemodAdd 0.507101

MasVnrArea 0.472614

Fireplaces 0.466929

BsmtFinSF1 0.386420

LotFrontage 0.330714

WoodDeckSF 0.324413

2ndFlrSF 0.319334

OpenPorchSF 0.315856

HalfBath 0.284108

LotArea 0.263843

GarageYrBlt 0.261366

BsmtFullBath 0.227122

BsmtUnfSF 0.214479

BedroomAbvGr 0.168213

ScreenPorch 0.111447

PoolArea 0.092404

MoSold 0.046432

3SsnPorch 0.044584

BsmtFinSF2 -0.011378

BsmtHalfBath -0.016844

MiscVal -0.021190

Id -0.021917

LowQualFinSF -0.025606

YrSold -0.028923

OverallCond -0.077856

MSSubClass -0.084284

EnclosedPorch -0.128578

KitchenAbvGr -0.135907

Name: SalePrice, dtype: float64

Now I will analyze attributes that have more than 0.5 correlation with class attribute and those that have strong correlation with other attributes.

Attributes that has high correlation with SalePrice:

OverallQual, GrLivArea, GarageCars, GarageArea, TotalBsmtSF, 1stFlrSF, FullBath, TotRmsAbvGrd, YearBuilt, YearRemodAdd

Let’s examine them;

**1)** OverallQual has correlation with lots of attributes but I think those correlations are not enough to take action.

**2)** GrLivArea has strong correlation with TotRmsAbvGrd. It is actually pretty normal that the total room count depends on the total living area. But I won’t remove any of them as I think they should be considered separately. Because for the same total living area, some people may prefer fewer rooms with higher space while others prefer more but smaller rooms.

**3)** GarageArea and GarageCars are both highly correlated with SalePrice, but also they are highly correlated with each other. When we check their meaning;

GarageArea: Size of garage in square feet

It is obvious that garage car capacity depends on the are of the garage. Instead of taking both of them, we may remove one of those attributes. In this case, GarageCars is little bit more correlated with class attribute. So, we can consider taking it and removing GarageArea.

**4)** TotalBsmtSF and 1stFlrSF are highly correlated with SalePrice as well as each other. considering most of the time basement would be same size of the first floor. We can just take one of them and remove the other. In this case TotalBsmtSF is little bit more correlated with class attribute. Therefore I will ignore 1stFlrSF.

**5)** YearBuilt and YearRemodAdd are also correlated with each other. In the description of YearRemodAdd, it says; it is same as YearBuilt if there was no re-modelling. But when I check the amount of time they are same, they appeared to be same for 764 times while there are 1460 samples. Which means a lot of house have remodeling or additions. Therefore I won’t delete those attributes.

After this quick comparison, I will create my important numerical feature list that is consist of;

OverallQual, GrLivArea, GarageCars, TotalBsmtSF, FullBath, TotRmsAbvGrd, YearBuilt, YearRemodAdd

We can check their distribution with class variable:

'FullBath', 'TotRmsAbvGrd', 'YearBuilt', 'YearRemodAdd', 'SalePrice']

for i in range(0, len(important_numerical_features), 3):

sns.pairplot(data=data_train, x_vars=important_numerical_features[i:i+3],

y_vars=['SalePrice']),

plt.show()

Now let’s check distribution of categorical attributes. To check their distribution with class variable we can use box plots.

column_count = len(categorical_features.columns)

for i in range(column_count):

if (i % 9) is not 0:

continue

fig, ax = plt.subplots(3, 3, figsize=(10, 8))

sns.boxplot(categorical_features.columns[i], 'SalePrice', data=data_train, ax=ax[0, 0])

if i is (column_count - 1): continue

sns.boxplot(categorical_features.columns[i + 1], 'SalePrice', data=data_train, ax=ax[0, 1])

if i is (column_count - 2): continue

sns.boxplot(categorical_features.columns[i + 2], 'SalePrice', data=data_train, ax=ax[0, 2])

if i is (column_count - 3): continue

sns.boxplot(categorical_features.columns[i + 3], 'SalePrice', data=data_train, ax=ax[1, 0])

if i is (column_count - 4): continue

sns.boxplot(categorical_features.columns[i + 4], 'SalePrice', data=data_train, ax=ax[1, 1])

if i is (column_count - 5): continue

sns.boxplot(categorical_features.columns[i + 5], 'SalePrice', data=data_train, ax=ax[1, 2])

if i is (column_count - 6): continue

sns.boxplot(categorical_features.columns[i + 6], 'SalePrice', data=data_train, ax=ax[2, 0])

if i is (column_count - 7): continue

sns.boxplot(categorical_features.columns[i + 7], 'SalePrice', data=data_train, ax=ax[2, 1])

if i is (column_count - 8): continue

sns.boxplot(categorical_features.columns[i + 8], 'SalePrice', data=data_train, ax=ax[2, 2])

for a in ax.flatten():

for label in a.get_xticklabels():

label.set_rotation(90)

plt.show()

Actually the code to print box plots is very simple;

The reason behind complexity(redundance) of above code is because I wanted to plot box graph of each attribute at the same time and specifically 3 x 3 attributes on per frame. Plotting one graph per page would be a trouble as it pops up a new page for each frame. And plotting every attribute on the same page would make it unreadable, therefore I choose to go this way.

I have also wrote another code which does exactly the same thing.

Second way (Slower):

column_count = len(categorical_features.columns)

attr = 0

result = True

while result is True:

fig, ax = plt.subplots(3, 3, figsize=(10, 8))

for row in range(3):

for column in range(3):

sns.boxplot(categorical_features.columns[attr], 'SalePrice', data=data_train, ax=ax[row, column])

attr = attr + 1

if attr is column_count:

result = False

break

for a in ax.flatten():

for label in a.get_xticklabels():

label.set_rotation(90)

plt.show()

The code looks cleaner than first one but the former is slightly faster. Therefore, I am using it. The calculation time for the first one was around 3.2 seconds while for the second one it was around 3.3 seconds. The difference doesn’t seem much but it can go up if we get more attributes. There might be much more efficient way to do the same job, but I am very new to these seaborn and matplotlib libraries and I have tried to come up with this solution by myself.

Here we can see attributes such as ExterQual, KitchenQual, HeatingQC, BsmtQual and FireplaceQu are highly correlated with SalePrice. When their quality is Excellent, prices tend to go up.

Although analyzing these graphs can give some insight about the effect of categorical attributes on SalePrice, creating custom features by turning those categories into numerical values would make it much more easier for us to see the correlation. To encode categories into ordered numbers, we should have information about their order. For example;

Ex Excellent

Gd Good

TA Average/Typical

Fa Fair

Po Poor

We can easily create new feature from HeatingQC whose values would be 1, 2, 3, 4 and 5 (1 = Po, 5 = Ex)

Actually it is possible to encode all categorical attributes into numerical values even if their values can not be ordered. We can achieve this by simply taking mean SalePrice for each category of given attribute and then ordering them. It may be necessary for the feature engineering but I think for this **EDA (Exploratory Data Analysis)** checking box plots is enough.

Thank you for reading so far! I hope it was somehow useful.