PFA Housing Price Prediction

Harshitha indumathi
10 min readJun 3, 2021

Introduction

Problem statement:-

Houses are one of the necessary need of each and every person around the globe and therefore housing and real estate market is one of the markets which is one of the major contributors in the world’s economy. It is a very large market and there are various companies working in the domain. Data science comes as a very important tool to solve problems in the domain to help the companies increase their overall revenue, profits, improving their marketing strategies and focusing on changing trends in house sales and purchases. Predictive modelling, Market mix modelling, recommendation systems are some of the machine learning techniques used for achieving the business goals for housing companies. Our problem is related to one such housing company.

A US-based housing company named Surprise Housing has decided to enter the Australian market. The company uses data analytics to purchase houses at a price below their actual values and flip them at a higher price. For the same purpose, the company has collected a data set from the sale of houses in Australia.

The company is looking at prospective properties to buy houses to enter the market. You are required to build a model using Machine Learning in order to predict the actual value of the prospective properties and decide whether to invest in them or not

1.Checking the missing values

Missing values in the dataset can be checked by below python code:-

missing_values=[x for x in df.columns if df[x].isnull().sum()>1]

print(‘Number of missing variable columns:’, len(missing_values))

print(“Missing values in the dataset : \n “, missing_values)

print(“-”*125)

df[missing_values].head()

2.. Checking the percentage of the missing values

Missing value can be checked using the following code:-

for feature in missing_values:

print(feature, np.round(df[feature].isnull().mean()*100,4), “% Missing Values”)

Observation:-

  1. There are many missing values present in the dataset
  2. hence need to check the relationship with Saleprice

Representation of Missing value vs SalePrice

3.Extracting all the numerical feature

Extracting all the numerical values using python code:-

numerical_features=[x for x in df.columns if df[x].dtypes != “O”]

print(“The number of the numerical columns in the dataset:”, len(numerical_features))

print(“Numerical columns in the dataset:\n”, numerical_features)

print(“-”*125)

df[numerical_features].head()

Observation:-

1. ‘YearBuilt’,’YearRemodAdd’,’GarageYrBlt’,’YrSold’ are date columns we have in this dataset.

2. From the datatime column we usually extract the no of days, years, hours, minutes etc. hence this can be derived from the columns.

4. Extract the year column from the dataset:

Extract the year column from the dataset using the python code:-

year_feature=[x for x in df.columns if ‘Yr’ in x or ‘Year’ in x]

print(“The number of Year column in the dataset :”,len(year_feature))

print(“Year columns in the dataset :\n”,year_feature)

print(“-”*125)

df[year_feature].head()

5.Checking the unique items in date time columns

Checking the unique items in datetime columns using the python code:-

# checking the unique items in the datetime columns

for feature in year_feature:

print(“The unique items in the colunmn”, feature, “:\n”, df[feature].unique())

Relationship between feature vs Saleprice

# relationship between year variables and SalePrice can be done using the python code

for feature in year_feature:

plt.figure(figsize=(8,6))

df.groupby(feature)[‘SalePrice’].median().plot()

plt.xlabel(feature)

plt.ylabel(‘SalePrice’)

plt.show()

Data Visualization:-

Extracting the discrete and continous variable using the python code:-

discrete_feature=[x for x in numerical_features if len(df[x].unique())<25 and x not in year_feature+[‘Id’]]

print(“The number of discrete column in the dataset:”, len(discrete_feature))

print(“Discrete columns in the datset: \n”, discrete_feature)

print(“-”*125)

df[discrete_feature].head()

Extracting the continous variable

continous_feature=[x for x in numerical_features if x not in discrete_feature+year_feature+[‘Id’]]

print(“The number of continous feature column in the dataset :”,len(continous_feature))

print(“Continous feature columns in the dataset :\n”,continous_feature)

print(“-”*125)

df[continous_feature].head()

observation:-

The number of columns present in the dataset:-16

Observation:-1. Most of the features are right skewed

2.Need to go to transformation

Log transformation can be done using the following python code:-

for feature in continous_feature:

data=df.copy()

if 0 in data[feature].unique():

pass

else:

data[feature]=np.log(data[feature])

data[‘SalePrice’]=np.log(data[‘SalePrice’])

plt.scatter(data[feature],data[‘SalePrice’])

plt.xlabel(feature)

plt.ylabel(‘SalesPrice’)

plt.title(feature)

plt.show()

To check the outliers we are using the box plot.

for feature in continous_feature:

data=df.copy()

if 0 in data[feature].unique():

pass

else:

data[feature]=np.log(data[feature])

data.boxplot(feature)

plt.ylabel(feature)

plt.title(feature)

plt.show()

Observation:-There are lot of outliers therefore outlier treatment is required.

Realation Between categorical feature and SalePrice

Missing value can be replaced by the word Missing in Feature Engineering using Python code

def replace_cat_feature(df, features_nan):

data=df.copy()

data[features_nan]=data[features_nan].fillna(‘Missing’)

return data

df=replace_cat_feature(df, features_nan)

df[features_nan].isnull().sum()

Missing values present in Numerical Variables can replaced by the word Missing using the following python code:-

for feature in numerical_with_nan:

median_value=df[feature].median()

df[feature+’nan’]=np.where(df[feature].isnull(),1,0)

df[feature].fillna(median_value,inplace=True)

df[numerical_with_nan].isnull().sum()

Extracting the new Feature from Date time Variable using the following Python code:-

for feature in [‘YearBuilt’,’YearRemodAdd’,’GarageYrBlt’]:

df[feature]=df[‘YrSold’]-df[feature]

Make the logTransformation to remove the Right skewness in the histogram using the following python code:- in the features ‘LotFrontage’, ‘LotArea’, ‘1stFlrSF’, ‘GrLivArea’, ‘SalePrice’ , outliers are present

num_features=[‘LotFrontage’, ‘LotArea’, ‘1stFlrSF’, ‘GrLivArea’, ‘SalePrice’]

for feature in num_features:

df[feature]=np.log(df[feature])

Categorical Encoding:- after outliers, skewness is removed using boxplot, log transformation, We are using Label Encoder to label from categorical to numerical using the following code:-

from sklearn.preprocessing import LabelEncoder

labelencoder=LabelEncoder()

for feature in categorical_features:

df[feature]=labelencoder.fit_transform(df[feature])

Similarly the Missing data can be handled in test data using the following python code:-

##replce missing value with new value

def replace_cat_feature_test(df1, features_nan_test):

data=df1.copy()

data[features_nan_test]=data[features_nan_test].fillna(‘Missing’)

return data

df1=replace_cat_feature_test(df1, features_nan_test)

df1[features_nan_test].isnull().sum()

Missing value present in the test data can be removed using the python code:-

for feature in numerical_with_nan_test:

median_value=df1[feature].median()

df1[feature+’nan’]=np.where(df1[feature].isnull(),1,0)

df1[feature].fillna(median_value,inplace=True)

df1[numerical_with_nan_test].isnull().sum()

Similarly as in the train data, Extract the Date Time Variable using following the python code:-

## Date Time Variables

for feature in [‘YearBuilt’,’YearRemodAdd’,’GarageYrBlt’]:

df1[feature]=df1[‘YrSold’]-df1[feature]

Feature ‘LotFrontage’, ‘LotArea’, ‘1stFlrSF’, ‘GrLivArea’ have the missing value. It can be handled by using log transformation

num_features_test=[‘LotFrontage’, ‘LotArea’, ‘1stFlrSF’, ‘GrLivArea’]

for feature in num_features_test:

df1[feature]=np.log(df1[feature])

similarly after removing the skewness, we are using the LabelEncoder to convert categorical to numerical using the following python code:-

for feature in categorical_features_test:

df1[feature]=labelencoder.fit_transform(df1[feature])

Feature Scaling:-

We are using the Min Max scaler for Scaling purpose:-

from sklearn.preprocessing import MinMaxScaler

scaler=MinMaxScaler()

after applying the MinMax Scaler, we are dividing the train and test data using the follow python code:-

y_train=df[[‘SalePrice’]]

x=df.drop([‘Id’, ‘SalePrice’], axis=1)

Regression Techniques used:-

1.Linear Regression

2.Lasso Regression

3.Ridge Regression

4.Decision Tree Regression

5.Random Forest Regression

Conclusion

Lasso regression model is considered as the best model among 5 because of less error 0.20 followed by ridge (0.22)

--

--

Harshitha indumathi

Harshitha completed my graduation in B.E . this is my first article on medium