Predicting housing prices using data analysis tools like Python has become popular with real estate investors. The concept is simple — use historical data from the past, apply predictive analytics models such as Machine Learning, and predict future housing prices.
Predicting house prices using property data involves using statistical and machine learning techniques to analyze historical data on real estate transactions and other relevant factors to make predictions about future house prices.
As an example to take for this blog, let's take Amsterdam, a popular and highly sought-after city to live in the Netherlands. The housing market in Amsterdam is a fascinating one. The city is one of the most popular places to live in the Netherlands, and its population has been growing steadily since the 1950s. It's an attractive place for expats and tourists alike, and it's also home to many local businesses.
In order for real estate developers to make informed decisions about where to build, they need to have access to detailed and accurate data about the housing market in their area. One of our team members was looking around kaggle and found a housing price prediction problem as part of the machine-learning tutorial. We wanted to try something similar but using a different source of data and features. We scraped the data from a property website in the Netherlands using Python.
We obtained the real estate data via web scraping over a period of 30 days and merged the data into a single csv file. The dataset consists of continuous and discrete variables totaling 992 records with 27 features. Here, some machine learning models are proposed to predict house prices based on data related to the house, like area and volume of the house, number of rooms, number of floors, number of bathrooms, etc.
Data Exploration
Data exploration is a critical step in the machine learning process that helps to ensure that the data used to train the model is of high quality and is suitable for the task at hand. It includes importing the data, exploratory data analysis, data cleaning, feature engineering, data transformation, and splitting data into test data and train data.
Importing libraries and dataset
The first step is to import the necessary libraries and the data into a format that can be easily accessed and analyzed. This typically involves reading the data into a Pandas DataFrame or a NumPy array.
Importing libraries:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from xgboost import XGBRegressor
from sklearn.metrics import r2_score
from sklearn.linear_model import Lasso, Ridge
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
Importing dataset:
df_ams=pd.read_csv('Amsterdam.csv')
The dataset contains the following features:
1. Url: Url from which the housing data of Amsterdam is scrapped
2. Title: Address of each house
3. Location: Location of the house
4. Price Per Month: Rent for the house per month
5. Area in Sq Metre: Area of the house in sq. meter
6. Number of rooms: Total number of rooms in the house
7. Interior: Details of the interior of the house, whether equipped with windows, carpeted floors, furniture, etc.
8. Description: Description of the house given for advertisement
9. Offered Since: The date from which the offering was made for the house
10. Availability: Details like if the house is immediately available or the date from which it will be available
11. Specification: Details like if it is accessible for disabled people, seniors, etc.
12. Upkeep Status: Status of maintenance of the house
13. Volume: Volume of the house
14. Type: Type of the place like house, apartment, studio, etc.
15. Construction Type: Whether new building or existing building
16. Construction Year: Constructed year of the building
17. Location Type: Location of the house specific to neighborhoods
18. Number of Bedrooms: Number of bedrooms in the house
19. Number of Bathrooms: Number of bathrooms in the house
20. Number of Floors: Number of floors in the house
21. Details of Balcony: Availability of balcony
22. Details of Garden: Details like availability, area, and location
23. Details of Storage: Availability of storage
24. Description of storage: Details like indoors, built of detached woods, stone, etc.
25. Garage: Availability of garage
26. Contact Details: Contact details of the house owner
27. Timestamp: Time when details are recorded
Here’s a small snippet of the dataset:
The target variable here is 'PRICE PER MONTH', which represents the amount per month at which the property is rented.
Data Cleaning
Data cleaning involves identifying and correcting any errors or inconsistencies in the data, such as missing values, duplicate records or incorrect data types. This is an important step because it helps to ensure that the data is complete and accurate, which is necessary for building reliable models.
First of all, all the columns that are not likely to help in predicting the target variable are dropped from the data frame.
df_ams.drop(columns=['TITLE',
'File',
'URL',
'OFFERED SINCE',
'CONTACT DETAILS',
'TIMESTAMP',
'DESCRIPTION'],axis=1,inplace=True)
Missing values
It is important to fill in missing data with NaN (Not a Number) because it allows you to identify missing values in your data clearly. Filling missing values with NaN allows us to identify which values are missing and take appropriate action easily.
Here, the missing values in the columns ‘Volume’, ‘Interior’, ‘Availability’, ‘Garage’, ‘Upkeep Status,’ ‘Specification’, ‘Location Type’, ‘Number of floors’, ‘Details of Garden’, ‘Details of Storage’, ‘Number of Bedrooms’, ‘Details of Balcony’, ‘Number of Bathrooms’ and ‘Description of Storage’ are addressed by filling it with NaN values.
df_ams['VOLUME'] = df_ams['VOLUME'].replace('Volume is not available',np.nan)
df_ams['INTERIOR'] = df_ams['INTERIOR'].replace('Interior is not available',np.nan)
df_ams['AVAILABILITY'] = df_ams['AVAILABILITY'].replace('Not available to book',np.nan)
df_ams['GARAGE'] = df_ams['GARAGE'].replace('Details of garage is not available',np.nan)
df_ams['UPKEEP STATUS'] = df_ams['UPKEEP STATUS'].replace('Upkeep is not available',np.nan)
df_ams['SPECIFICATION'].replace('Specifics are not available',np.nan)
df_ams['LOCATION TYPE'] = df_ams['LOCATION TYPE'].replace('Location type is not available',np.nan)
df_ams['NUMBER OF FLOORS'] = df_ams['NUMBER OF FLOORS'].replace('Number of floors is not available',np.nan)
df_ams['DETAILS OF GARDEN'] = df_ams['DETAILS OF GARDEN'].replace('Details of garden is not available',np.nan)
df_ams['DETAILS OF STORAGE'] = df_ams['DETAILS OF STORAGE'].replace('Details of storage is not available',np.nan)
df_ams['NUMBER OF BEDROOMS'] = df_ams['NUMBER OF BEDROOMS'].replace('Number of bedrooms is not available',np.nan)
df_ams['DETAILS OF BALCONY'] = df_ams['DETAILS OF BALCONY'].replace('Details of balcony is not available',np.nan)
df_ams['NUMBER OF BATHROOMS'] = df_ams['NUMBER OF BATHROOMS'].replace('Number of bathrooms is not available',np.nan)
df_ams['DESCRIPTION OF STORAGE'] = df_ams['DESCRIPTION OF STORAGE'].replace('Details of description of the storage is not available',np.nan)
Now a list of columns with NaN values is created, and the columns ‘Specification’, ‘Location Type’, ‘Description of storage,’ and ‘Availability’ seem to have fewer data available. These columns are hence dropped from the data frame.
f_col_null = df_ams.columns[df_ams.isna().any()==True].tolist()
df_ams.drop(columns=['AVAILABILITY','SPECIFICATION','LOCATION TYPE','DESCRIPTION OF STORAGE'],axis=1,inplace=True)
Out of all the features in the dataset, most of the features have an object data types. So before converting the data types, it is necessary to impute the NaN values with mean, median, or mode since the data type of a column cannot be converted to an integer type with NaN values in it.
f_ams['VOLUME'].fillna(df_ams['VOLUME'].mean(),inplace=True)
df_ams['GARAGE'].fillna(df_ams['GARAGE'].mode()[0],inplace=True)
df_ams['INTERIOR'].fillna(df_ams['INTERIOR'].mode()[0],inplace=True)
df_ams['UPKEEP STATUS'].fillna(df_ams['UPKEEP STATUS'].mode()[0],inplace=True)
df_ams['NUMBER OF FLOORS'].fillna(df_ams['NUMBER OF FLOORS'].mode()[0],inplace=True)
df_ams['DETAILS OF GARDEN'].fillna(df_ams['DETAILS OF GARDEN'].mode()[0],inplace=True)
df_ams['DETAILS OF BALCONY'].fillna(df_ams['DETAILS OF BALCONY'].mode()[0],inplace=True)
df_ams['DETAILS OF STORAGE'].fillna(df_ams['DETAILS OF STORAGE'].mode()[0],inplace=True)
df_ams['NUMBER OF BEDROOMS'].fillna(df_ams['NUMBER OF BEDROOMS'].mode()[0],inplace=True)
df_ams['NUMBER OF BATHROOMS'].fillna(df_ams['NUMBER OF BATHROOMS'].mode()[0],inplace=True)
Converting the datatype of columns with integers:
df_ams['NUMBER OF BEDROOMS'] = df_ams['NUMBER OF BEDROOMS'].astype(int)
df_ams['NUMBER OF BATHROOMS'] = df_ams['NUMBER OF BATHROOMS'].astype(int)
df_ams['NUMBER OF FLOORS'] = df_ams['NUMBER OF FLOORS'].astype(int)
Feature Engineering
Feature engineering involves creating and selecting features from the raw data that will be used as input to the model. This can involve creating new features by combining or transforming existing ones, selecting a subset of relevant features, or engineering features to better suit the model's needs.
Here, 2 features, ‘Details of Garden’ and ‘Location’ are taken into consideration. From the column ‘Details of Garden’, it is possible to extract features like availability of the garden, area, and location of the garden as separate features. Similarly, from the column ‘Location’, it is possible to extract a new feature that is similar to the locate pin code.
Details of Garden:
# splitting the column with '(' into 2 different columns
df_ams[["AVAILABILITY OF GARDEN",'AREA OF GARDEN']] = df_ams["DETAILS OF GARDEN"].str.split("(", expand = True)
#removing unnecessary characters from data
df_ams['AVAILABILITY OF GARDEN']=df_ams['AVAILABILITY OF GARDEN'].replace(['Present '],'Present')
#splitting the column 'Area of Garden' with 'm²' into 2 different columns of area and location
df_ams[['AREA OF GARDEN IN SQ METRE','GARDEN LOCATION']] = df_ams["AREA OF GARDEN"].str.split("m²", expand = True)
#removing extra characters and filling missing data with NaN
df_ams['GARDEN LOCATION'] = df_ams['GARDEN LOCATION'].str[1:-1]
df_ams['GARDEN LOCATION'] = df_ams['GARDEN LOCATION'].replace('',np.nan)
#dropping the extra columns
df_ams.drop(columns=['AREA OF GARDEN','DETAILS OF GARDEN'],axis=1,inplace=True)
Here, the feature ‘Details of garden’ is first splitted into 2 columns, ‘Availability of Garden’ and ‘Area of garden’ using the character ‘(‘ in the string of ‘Details of Garden’ and then unnecessary characters are removed from the new features.
So now the feature ‘Availability of Garden’ contains the details of whether a property has a garden area or not, and the feature ‘Area of Garden’ contains the area of the garden in sq. meter along with the location of the garden area. So in order to get them as different features, the column ‘Area of Garden’ is again splitted into columns of ‘Area of Garden in Sq Metre’ and ‘Garden Location’. After extra characters are removed from the features, NaN values are filled, and extra columns are removed from the dataframe.
Location:
#splitting the column location with ' ' into 5 different columns
df_ams[['LOCATION PIN','A','B','C','D']] = df_ams["LOCATION"].str.split(" ", expand = True)
#dropping extra columns
df_ams.drop(columns=['A','B','C','D','LOCATION'],axis=1,inplace=True)
The column ‘Location’ is split into the column ‘Location Pin’ and 4 other columns using the character ‘ ’ in the string. And all other extra columns are dropped from the data frame by only keeping the column ‘Location Pin’.
#columns with null values
df_col_null = df_ams.columns[df_ams.isna().any()==True].tolist()
df_ams[df_col_null].isna().sum()
#drop columns with less data available
df_ams.drop(columns=['AREA OF GARDEN IN SQ METRE','GARDEN LOCATION'],axis=1,inplace=True)
df_ams['AVAILABILITY OF GARDEN'].fillna(df_ams['AVAILABILITY OF GARDEN'].mode()[0],inplace=True)
Data Transformation
Data transformation involves applying various operations to the data to prepare it for modeling. This can include scaling the data, imputing missing values, and applying one-hot encoding and label encoding to categorical variables.
Label encoding:
Label encoding is a method of preprocessing data in machine learning, which involves assigning a unique numerical value to each category in a categorical variable.
le = LabelEncoder()
df_ams['INTERIOR']=le.fit_transform(df_ams['INTERIOR'])
df_ams['TYPE']=le.fit_transform(df_ams['TYPE'])
df_ams['UPKEEP STATUS']=le.fit_transform(df_ams['UPKEEP STATUS'])
Here the categorical features containing more than 2 categories are encoded using label encoding.
One hot encoding:
One hot encoding involves converting each categorical value into a new column and assigning a binary value of 1 or 0 to indicate the presence or absence of the categorical value in the original data.
#dataframe with columns for which one hot encoding is needed
dum_cols=df_ams[['DETAILS OF BALCONY','GARAGE','DETAILS OF STORAGE','CONSTRUCTION TYPE','AVAILABILITY OF GARDEN']]
#one-hot encoding
dum_ams=pd.get_dummies(dum_cols)
#dropping extra columns and renaming the rest encoded columns
dum_ams.drop(columns=['DETAILS OF BALCONY_Not present',
'GARAGE_No',
'DETAILS OF STORAGE_Not present',
'CONSTRUCTION TYPE_Existing building',
'AVAILABILITY OF GARDEN_Not present']
,axis=1,inplace=True)
dum_ams.rename(columns={
'DETAILS OF BALCONY_Present':'AVAILABILITY OF BALCONY',
'GARAGE_Yes':'AVAILABILITY OF GARAGE',
'DETAILS OF STORAGE_Present':'AVAILABILITY OF STORAGE',
'CONSTRUCTION TYPE_New development':'NEW BUILDING',
'AVAILABILITY OF GARDEN_Present':'AVAILABILITY OF GARDEN'
},inplace=True)
#dropped extra columns
df_ams.drop(columns=['DETAILS OF BALCONY','GARAGE','DETAILS OF STORAGE','CONSTRUCTION TYPE','AVAILABILITY OF GARDEN'],axis=1,inplace=True)
#concatenating encoded dataframe with original dataframe
df_ams=pd.concat((df_ams,dum_ams),axis=1)
Columns needed to be hot encoded are made into a dataframe and encoded using get_dummies. The extra columns are then dropped, and the required columns are renamed and concatenated into the original dataframe.
Feature Selection:
A heatmap can be a useful tool for identifying features of a machine learning model. Visualizing the correlations using a heatmap will make it easy to identify which independent features are most strongly correlated with the target feature. We can use this information to inform the analysis or modeling efforts.
To use a heatmap for this purpose, we can create a correlation matrix for our data by computing the correlation coefficient between each pair of features. And then, create a heatmap using this correlation matrix to visualize the strength of the correlations between the different features. Features correlated with the target variable will appear in different shades of color on the heatmap, with values nearer to 1 or -1 and lighter shades indicating a stronger correlation. Features with values nearer to 1 show a stronger positive correlation, values nearer to -1 show a stronger negative correlation, and values nearer to 0 show no correlation.
plt.figure(figsize=(10,5))
sns.heatmap(df_ams.corr(),annot=True)
plt.show()
From this visualization, it is clear that the features ‘Area in Sq Metre’, ‘Number of Rooms,’ ‘Volume’, ‘Number of Bedrooms’, ‘Number of Bathrooms,' and ‘Number of Floors’ show a better correlation with the target feature ‘Price Per month’ which is a good indicator for using these features as predictors.
But also there is a high correlation between the features ‘Number of Bedrooms’ and ‘Number of Rooms’ and so ‘Number of Bedrooms’ is dropped along with other uncorrelated features.
df_ams.drop(columns=['TYPE',
'AVAILABILITY OF BALCONY',
'AVAILABILITY OF GARAGE',
'INTERIOR',
'NUMBER OF BEDROOMS',
'AVAILABILITY OF STORAGE',
'LOCATION PIN',
'AVAILABILITY OF GARDEN',
'NEW BUILDING',
'UPKEEP STATUS',
'CONSTRUCTION YEAR'],axis=1,inplace=True)
Let’s visualize the relationship between the predictor features vs target feature:
sns.pairplot(df_ams,
x_vars=X.columns,
y_vars='PRICE PER MONTH',
diag_kind='hist',
kind='reg')
All the predictor features seem to have a positive correlation with price.
Now one final check for duplicates is done and then they are removed.
df_ams.drop_duplicates(inplace=True)
Data Split and Modelling
X=df_ams.drop(['PRICE PER MONTH'],axis=1)
y=df_ams['PRICE PER MONTH']
The X array represents the features of the data, which are the input variables that will be used to make predictions. The y array represents the target variables, which are the values we are trying to predict i.e., price.
Detecting outliers:
For detecting outliers from the independent features, a function is defined to calculate the first (25th) and third (75th) quartiles of the data and then calculate the interquartile range (IQR) by subtracting the first quartile from the third quartile. The IQR is a measure of dispersion used to identify outliers in the data.
def detect_outliers(x):
quartile_1, quartile_3 = np.percentile(df_ams, [25, 75])
iqr = quartile_3 - quartile_1
lower_bound = quartile_1 - (iqr * 1.5)
upper_bound = quartile_3 + (iqr * 1.5)
return (x > upper_bound) | (x < lower_bound)
Then outliers are found and removed from the independent features. There are found to be outliers in ‘volume,’ and the null values are removed. The outliers are filled with the median value of the column.
outliers = detect_outliers(X)
X = X[~outliers]
X['VOLUME'].isna().sum()
X['VOLUME'].fillna(X['VOLUME'].median(),inplace=True)
Split the dataset into a training set and a test set:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)
The X and y arrays represent the features and labels. The test_size parameter specifies the proportion of the data that should be allocated to the test set (in this case, 20%). The random_state parameter controls the random sampling of the data.
Scaling:
Scaling the data is often a necessary step in machine learning because it can help to ensure that all of the features are on the same scale, which can improve the performance of some models.
scaler=StandardScaler() scaler.fit(X_train) X_train_scaled=scaler.transform(X_train) X_test_scaled=scaler.transform(X_test)
After the data is transformed, the X_train_scaled and X_test_scaled arrays can be used as input to a model instead of the original X_train and X_test arrays.
Modeling:
Modeling is the process of building a statistical or machine-learning model to make predictions or decisions based on data.
Lasso Regression:
The Lasso class from scikit-learn is used to create a lasso regression model, and use the GridSearchCV function to tune the hyperparameters of a Lasso regression model. GridSearchCV performs an exhaustive search over a specified parameter grid, evaluating the model for each combination of hyperparameters using cross-validation.
In this case, the hyperparameter being tuned is alpha, which is the regularization strength of the Lasso model. The values of alpha being searched are [0.01, 0.1, 1, 10, 100].
After the GridSearchCV object is fit to the training data, the best hyperparameters (in this case, the best value of alpha) are stored in the best_params_ attribute of the grid_search object.
Then, the model is used to make predictions on the test set using the predict function, and the mean squared error (MSE) between the true labels and the predicted labels is calculated using the mean_squared_error function from scikit-learn. Finally, the test score (i.e. the coefficient of determination, R^2) is also calculated and printed.
model = Lasso()
# Create a dictionary of hyperparameters to search
param_grid = {'alpha': [0.01, 0.1, 1, 10, 100]}
# Use GridSearchCV to tune the hyperparameters
grid_search = GridSearchCV(model, param_grid, cv=5)
#fitting the lasso reression model
grid_search.fit(X_train_scaled, y_train)
#predicting the price
y_pred = grid_search.predict(X_test_scaled)
# Print the best hyperparameters
print("Best hyperparameters:", grid_search.best_params_)
# Evaluate the model on the test set
print("Mean Squared Error: %.3f" % mean_squared_error(y_test, y_pred))
print("Test score: %.4f" % (r2_score(y_test,y_pred)))
Ridge Regression:
In Ridge Regression, similar to the Lasso Regression model, the GridSearchCV function from the sklearn library performs a grid search over the specified hyperparameter space to find the best hyperparameters for a Ridge regression model.
The Ridge regression model and the hyperparameters to search over are passed to GridSearchCV as arguments. The cv parameter specifies the number of cross-validation folds to use in the grid search. The best_params_ attribute of the grid_search object is used to print out the best hyperparameters found by the grid search.
model = Ridge()
# Create a dictionary of hyperparameters to search
param_grid = {'alpha': [0.01, 0.1, 1, 10, 100]}
# Use GridSearchCV to tune the hyperparameters
grid_search = GridSearchCV(model, param_grid, cv=5)
#fitting the model for the tuned parameters
grid_search.fit(X_train_scaled, y_train)
#predicting for the price
y_pred = grid_search.predict(X_test_scaled)
# Print the best hyperparameters
print("Best hyperparameters:", grid_search.best_params_)
# Evaluate the model on the test set
print("Mean Squared Error: %.3f" % mean_squared_error(y_test, y_pred))
print("Test Score: %.4f" %grid_search.score(X_test_scaled,y_test))
Random Forest:
Here the RandomizedSearchCV function from the sklearn library is used to perform a randomized search over the specified hyperparameter space to find the best hyperparameters for a random forest model.
The random forest model and the hyperparameter search space are passed to RandomizedSearchCV as arguments. The n_iter parameter specifies the number of random combinations of hyperparameters to try. The cv parameter specifies the number of cross-validation folds to use in the search. The return_train_score parameter specifies whether to include train scores in the results.
model = RandomForestRegressor()
# Define the hyperparameter search space
param_distributions = {'n_estimators': [10, 100, 1000],
'max_depth': [None, 3, 5, 7],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]}
# Use RandomizedSearchCV to tune the hyperparameters
random_search = RandomizedSearchCV(model, param_distributions, n_iter=10, cv=5, return_train_score=True)
#fitting the model with the tuned parameters
random_search.fit(X_train_scaled, y_train)
#predicting for the price
y_pred = random_search.predict(X_test_scaled)
# Print the best hyperparameters
print("Best hyperparameters:", random_search.best_params_)
# Evaluate the model
print("Mean Squared Error: %.3f" % mean_squared_error(y_test, y_pred))
print("Test Score: %.3f" %random_search.score(X_test_scaled,y_test))
XGBoost:
XGBoost works by constructing a set of decision trees and combining their predictions through a process called boosting. This model uses the GridSearchCV function from the sklearn library to perform a grid search over the specified hyperparameter space to find the best hyperparameters for an XGBoost model.
The XGBoost model and the hyperparameter grid are passed to GridSearchCV as arguments. The cv parameter specifies the number of cross-validation folds to use in the grid search. The n_jobs parameter specifies the number of CPUs to use for the computation (-1 means using all available CPUs). The verbose parameter specifies the verbosity level.
model = XGBRegressor()
# Define the hyperparameter grid
param_grid = {'max_depth': [3, 5, 7],
'learning_rate': [0.1, 0.2, 0.3],
'n_estimators': [100, 200, 300],
'reg_lambda': [0.1, 1.0, 10.0]}
# Use grid search to tune the hyperparameters
grid_search = GridSearchCV(model, param_grid, cv=5, n_jobs=-1, verbose=1)
#fitting the model
grid_search.fit(X_train_scaled, y_train)
#prediction done for the price
y_pred = grid_search.predict(X_test_scaled)
# Print the best hyperparameters
print(grid_search.best_params_)
# Evaluate the model
print("Mean Squared Error: %.3f" % mean_squared_error(y_test, y_pred))
print("Test Score: %.3f" %grid_search.score(X_test_scaled,y_test))
Conclusion
So here we have demonstrated how to use machine learning techniques to predict house prices in Amsterdam. We started by exploring the data and identifying potential features that could influence house prices. We then used these features to train a lasso regression model, a Ridge regression model, a random forest model, and an XGBoost model. We also used hyperparameter tuning to improve the performance of the models.
It was found that all of the models performed well, with the XGBoost model having the best test set performance. However, it is important to note that the results of this analysis are dependent on the choice of features and the specific data used. Different data or a different set of features could lead to different conclusions.
Looking to acquire data sets for similar analysis? Contact Datahut to learn more.