House Prices Prediction with Regression Modelling and Features Selection

Introduction
Data Wrangling
Model

Introduction

Language used: Python

Goal: Prediction of house prices.

Data used: Dataset from Properati website (https://www.properati.com.ar/). Two datasets: one for training (dataframe: dfef) and one for testing (dataframe: dfp)

Link to the Dataset: https://www.kaggle.com/datasets/jluza92/argentina-properati-listings-dataset-20202021/data (1gb)

Libraries used

Library	Description
pandas	Data manipulation and analysis to work with structured data
numpy	For numerical operations to work on large and multi-dimensional arrays and matrices
sklearn	used for machine learning algorithms for classification, regression, clustering, dimensionality reduction
matplotlib.pyplot	Plotting library to create visualizations
seaborn	Statistical data visualization tool for informative statistical graphics

Variables:

Variable	Description
start_date	start date of the ad (2019-2020)
end_date	end date of the ad (2019-2020)
created_on	creation of the ad (2019-2020)
lat	latitude
lon	longitude
l1	country where the house is located
l2	region where the house is located
l3	neighbourhood where the house is located
l4	area of the neighbourhood where the house is located
rooms	Number of rooms
bedrooms	Number of bedrooms
bathrooms	Number of bathrooms
surface_total	total surface of the house
surface_covered	surface of the house (excluding balcony)
price	price
currency	price currency
title	title of the ad
description	description of the property
property_type	type of housing offered

Data wrangling

Values to be predicted are in dfp dataframe.

Summary of numeric variables values in the testing dataframe (dfp)

table

The focus is on properties in Capital Federal (Buenos Aires) and the southern area (GBA Zona Sur), specifically flats and penthouses for sale in USD.

In the training dataframe (dfef), only properties meeting the criteria in dfp and having a non-null price are considered. Irrelevant columns, such as empty or single-valued ones, are dropped.

In both dataframes, unique neighborhood names are identified. We then search for these names in the “title” and “description” columns. If found and the “Neighbourhood” column is empty, it is filled with the discovered value.

To handle remaining missing values in the “Neighbourhood” (l4) column, a KNN classifier is applied based on longitude and latitude.

code_1

To address missing values in the columns the following is used:

bedrooms: the number of rooms minus 1 (deducting the bathroom).
bathrooms: the number of rooms minus the bedrooms.
total_surface: same as the value covered_surface, if available.
covered_surface: same as the value total_surface, if available.

Following this, KNN Imputer is utilized to handle missing values in latitude, longitude, covered_surface, and total_surface.

With no missing values in longitude and latitude, the nan values for Neighborhoods are imputed with KNN classifier as with longitude and latitude the neighbourhood can be identified.

Additionally, an array is constructed with keywords associated with luxurious apartments.

palabras_clave = ['parrilla','balcon', 'patio','pileta', 'piletas', 'piscinas', 'gym', 'piscina', 'seguridad', 'subte', 'metrobus', 'terraza', 'jardin']

These words are searched in the ’title’ and ‘description’ columns. If they are present, in the new ’luxury’ column is assigned a value of 1; otherwise, it receives a 0.

Each neighborhood corresponds to a ‘Comuna’. Using another array, a new ‘Comuna’ column is created to associate each neighborhood with the correct Comuna.

A ‘median_price’ column is added by grouping the columns ’neighbourhood’, ‘rooms’, ‘bedrooms’, ‘bathrooms’ and ‘property_type’ and calculating the median price value for each group. Missing values are addressed by employing a KNN Classifier for imputation.

Model

Dataframes are filtered to contain only numeric columns. Categorical columns are converted to dummies.

Features Selection

For regression models, the assumption is that the variables are indepdent, that means not dependent on others. To check this, correlation is tested and only variables with p-values > 0.60 are included.

Importance Score

graph_2

The most important variables chosen by the model (lon, lat, surface_covered, median_price) in addition to variable “luxury” (as better results are obtained), are used for the prediction.

After trying different models and comparing the related mean squared error, it is considered that the ensamble Random Forest Regressor model provides the best accuracy. It is then applied to the training dataset and then to the testing dataset to get the predictions.

Table of Contents#

Introduction#

Data wrangling#

Model#

Table of Contents

Introduction

Data wrangling

Model