Ilya Tsakunov, David Chudán
Use of Data Mining for Analysis of Czech Real Estate Market
Číslo: 2/2023
Periodikum: Acta Informatica Pragensia
DOI: 10.18267/j.aip.215
Klíčová slova: Data mining; Web scraping; Real estate market; Exploratory analysis
Pro získání musíte mít účet v Citace PRO.
Anotace:
This paper analyses data from the real estate market domain. The data were scraped from the bezrealitky.cz portal. The analysis looks at both sales and rental data. A total of 3546 records and 54 attributes were obtained. A basic overview of the data was performed using exploratory data analysis where some basic characteristics of the data were identified, such as the average price of sold and rented flats. More specific results were obtained by applying data mining methods such as regression (linear regression, lasso regression and ridge regression) for predicting the flat prices and payments for utilities, classification (support vector machines, KNN, Gaussian naïve Bayes, decision tree and random forest) for estimating the PENB class (building energy performance certificate) and building condition. Lasso regression performed the most successfully (R2 = 0.76) in predicting the rent price. Among the classification tasks, the best result was achieved with random forest, which had an accuracy over 80% in some cases. Other tasks included clustering (k-means and k-modes) and anomaly detection (isolation forest). The main focus was on descriptive data mining, especially on clustering. Clusters created using the k-means algorithm (silhouette score of 0.78) with flats based on geographic coordinates were identified which show that the most expensive flats are on average in Bohemian regions, followed by Silesia and the cheapest are in central Moravia. Another cluster application identified flats in the Moravian-Silesian region with very high payments for utilities (silhouette score of 0.56). The models can help estimate the value of flats based on their attributes as well as location.