Table of Contents

  1. Introduction
  2. Data Wrangling
  3. Model

Introduction

Language used: Python

Goal: Prediction of house prices.

Data used: Dataset from Properati website (https://www.properati.com.ar/). Two datasets: one for training (dataframe: dfef) and one for testing (dataframe: dfp)

Link to the Dataset: https://www.kaggle.com/datasets/jluza92/argentina-properati-listings-dataset-20202021/data (1gb)

Libraries used

Library Description
pandas Data manipulation and analysis to work with structured data
numpy For numerical operations to work on large and multi-dimensional arrays and matrices
sklearn used for machine learning algorithms for classification, regression, clustering, dimensionality reduction
matplotlib.pyplot Plotting library to create visualizations
seaborn Statistical data visualization tool for informative statistical graphics

Variables:

Variable Description
start_date start date of the ad (2019-2020)
end_date end date of the ad (2019-2020)
created_on creation of the ad (2019-2020)
lat latitude
lon longitude
l1 country where the house is located
l2 region where the house is located
l3 neighbourhood where the house is located
l4 area of the neighbourhood where the house is located
rooms Number of rooms
bedrooms Number of bedrooms
bathrooms Number of bathrooms
surface_total total surface of the house
surface_covered surface of the house (excluding balcony)
price price
currency price currency
title title of the ad
description description of the property
property_type type of housing offered

Data wrangling

Values to be predicted are in dfp dataframe.

Summary of numeric variables values in the testing dataframe (dfp)

table

The focus is on properties in Capital Federal (Buenos Aires) and the southern area (GBA Zona Sur), specifically flats and penthouses for sale in USD.

In the training dataframe (dfef), only properties meeting the criteria in dfp and having a non-null price are considered. Irrelevant columns, such as empty or single-valued ones, are dropped.

In both dataframes, unique neighborhood names are identified. We then search for these names in the “title” and “description” columns. If found and the “Neighbourhood” column is empty, it is filled with the discovered value.

To handle remaining missing values in the “Neighbourhood” (l4) column, a KNN classifier is applied based on longitude and latitude.

code_1

To address missing values in the columns the following is used:

  • bedrooms: the number of rooms minus 1 (deducting the bathroom).
  • bathrooms: the number of rooms minus the bedrooms.
  • total_surface: same as the value covered_surface, if available.
  • covered_surface: same as the value total_surface, if available.

Following this, KNN Imputer is utilized to handle missing values in latitude, longitude, covered_surface, and total_surface.

With no missing values in longitude and latitude, the nan values for Neighborhoods are imputed with KNN classifier as with longitude and latitude the neighbourhood can be identified.

Additionally, an array is constructed with keywords associated with luxurious apartments.

palabras_clave = ['parrilla','balcon', 'patio','pileta', 'piletas', 'piscinas', 'gym', 'piscina', 'seguridad', 'subte', 'metrobus', 'terraza', 'jardin']

These words are searched in the ’title’ and ‘description’ columns. If they are present, in the new ’luxury’ column is assigned a value of 1; otherwise, it receives a 0.

Each neighborhood corresponds to a ‘Comuna’. Using another array, a new ‘Comuna’ column is created to associate each neighborhood with the correct Comuna.

A ‘median_price’ column is added by grouping the columns ’neighbourhood’, ‘rooms’, ‘bedrooms’, ‘bathrooms’ and ‘property_type’ and calculating the median price value for each group. Missing values are addressed by employing a KNN Classifier for imputation.

Model

Dataframes are filtered to contain only numeric columns. Categorical columns are converted to dummies.

Features Selection

For regression models, the assumption is that the variables are indepdent, that means not dependent on others. To check this, correlation is tested and only variables with p-values > 0.60 are included.

Importance Score

graph_2

The most important variables chosen by the model (lon, lat, surface_covered, median_price) in addition to variable “luxury” (as better results are obtained), are used for the prediction.

After trying different models and comparing the related mean squared error, it is considered that the ensamble Random Forest Regressor model provides the best accuracy. It is then applied to the training dataset and then to the testing dataset to get the predictions.