Predicting the Home Sale Prices

In this project, I will be predicting Sale prices of homes by utilizing linear regression modeling. The housing market has been booming in recent years, due to macroeconomic factors. This boom was jump started by a once a century pandemic event.

As with any market, the housing market’s value is determined by the supply and demand forces. Ofcourse, it is far too difficult to predict the macroeconomic environment for evaluating the housing market. But we can use features of individual buildings to predict their value and use that information along with the economic data to appropriately value houses. This project’s aim is to collect information from the features of individual living houses and use that to create a model and predict sale prices of those houses.

The following are the python libraries that I have utilized in the project:

pandas

matplotlib

seaborn

numpy

statsmodels (variance_inflation_factor)

scipy (linregress)

sklearn (train_test_split, LinearRegression, mean_squared_error, linear_model)

The dataset I have chosen to explore is a single csv format file from kaggle.com, published by Kaggle. The name of the dataset is “House Prices - Advanced Regression Techniques” and it can be publicly accessed.

The dependent variable is the SalePrice of the homes.

The following are some of independent variables that I will using in the project:

MSSubClass (Identifies the type of dwelling involved in the sale.)

MSZoning (Identifies the general zoning classification of the sale.)

LotArea (Lot size in square feet)

BldgType (Type of dwelling)

HouseStyle (Style of dwelling)

OverallQual (Rates the overall material and finish of the house)

OverallCond (Rates the overall condition of the house)

YearBuilt (Original construction date)

TotalBsmtSF (Total square feet of basement area)

GrLivArea (Above grade (ground) living area square feet)

Regression Overview

Linear regression is a statistical method of finding the relationship between independent and dependent variables. It is a supervised modeling technique. One variable(y) is considered to be the dependent variable, the value that you are trying to predict. And another variable(x) is considered as the explanatory/independent variable. There may be more than 1 predictors/independent variables based on the model one chooses.

In order to identify if there exists a causal relationship between x and y, we can draw a scatter plot and determine the strength of the relationship by calculating the correlation coefficient of the variables. If the coefficient correlation(r) is closer to -1 then the variables are negatively correlated. The closer it is to +1 the more positively correlated the variables are. 0 means there exists no relationship. A least-squares regression(best-fit) line is drawn on the scatterplot. This line represents our linear model. It is important for variables to have some association for the model to be proven useful.

A linear regression line has the following equation:

y = mx + b

y = dependent variable

x = independent variable

m = slope

b = intercept

Experiment 1: Data Understanding

I looked at the columns in the dataframe and filtered 10 columns that I thought would be important. I also dropped columns that had a lot of null values because I wanted to keep columns that contained real data values for each observation. I also created a scatterplot for each of the following 10 columns and plotted them against the SalePrice(y variable) and also calculated r value to check the correlation. MSSubClass, MSZoning, BldgType, and OverallCond had weak and negative correlation values, therefore I dropped them. I also created a heat-map to better visualize the correlations and figure out what variables I would use for the first experiment.

Experiment 1: Pre-processing

Out of the 10 remaining columns, 3 were categorical variables, so I used the label encoder to label the data with numeric values. I also checked for possible NaN values, but there weren’t any present in the dataset.

Experiment 1: Modeling

The following picture shows the x and y variable for the initial model:

I used the following code block to create my linear model and fit it to the training dataset and predict the values with the testing dataset using the model.

Experiment 1: Evaluation

I used the following code line to calculate the root mean squared error value which is a good measure to evaluate a model and how well it fits the data. The closer RMSE to 0, the better a model is.

I also calculated the variance inflation factor which is a measure of the amount of multicollinearity. Multicollinearity occurs when independent variables are correlated with each other and this affects our model because we want the x variables to be uncorrelated with each other so that they can contribute useful information to the model. A vif value greater than 10 tells me that there is multicollinearity and that we need to drop a few variables.

Experiment 2

For experiment 2, I will be using the following x variables. I dropped OverallQual, TotalBsmtSF and GrLivArea for the second model.

The RMSE for this experiment is much higher than Experiment 1. I believe the reason might be that I used 6 predictors for the first model and 3 for the second model. And the vif scores are also very good.

Experiment 3

For the third experiment, I used the x variables that were very highly correlated with SalePrice.

The following image shows a heatmap that is ordered and formatted differently to find the top 3 correlated x variables.

The RMSE for this model is 40622.5 which is much better than RMSE for model 2. Although the vif values are high, which is a red flag. But the x variables which are living area square feet, material and finish of the house and basement area in square feet.

Conclusion

In this project I learned about creating linear regression modeling. I also learned that the more predictors you have the better the RMSE value. The higher correlated x variables also lead to a lower RMSE value and as a result a better model.

References

https://towardsdatascience.com/linear-regression-simplified-ordinary-least-square-vs-gradient-descent-48145de2cf76

http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm

https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html

https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

https://medium.com/machine-learning-with-python/multiple-linear-regression-implementation-in-python-2de9b303fc0c

The code can be accessed by clicking here.