Modern approaches to building interpretable models of the property market using machine learning on the base of mass cadastral valuation

from arxiv, 62 pages, 21 figures, 11 tables; after the major revision, accepted in journal Land Use Policy; changes: literature review is added to introduction section, new conclusion, comparison of the models with the random forest is added, the feature selection section is reconsidered, many minor corrections, language sufficiently improved

In this paper, we review modern approaches to building interpretable models of property markets using machine learning on the base of mass valuation of property in the Primorye region, Russia. There are numerous potential difficulties one could encounter in the effort to build a good model. Their main source is the huge difference between noisy real market data and ideal data usually used in tutorials on machine learning. This paper covers all stages of modeling: collection of initial data, identification of outliers, search and analysis of patterns in the data, formation and final choice of price factors, building of the model, and evaluation of its efficiency. For each stage, we highlight potential issues and describe sound methods for overcoming emerging difficulties on actual examples. We show that the combination of classical linear regression with kriging (interpolation method of geostatistics) allows to build an effective model for land parcels. For flats, when many objects are attributed to one spatial point, the application of geostatistical methods becomes problematic. Instead, we suggest linear regression with automatic generation and selection of additional rules on the base of decision trees, so called the RuleFit method. We compare the performance of our inherently interpretable models with well-proven "black-box" Random Forest method and demonstrate similar results. Thus we show, that despite such a strong restriction as the requirement of interpretability which is important in practical aspects, for example, legal matters, it is still possible to build effective models of real property markets.

翻译：本文综述了在俄罗斯滨海边疆区基于房地产大规模评估、利用机器学习构建房地产市场可解释模型的现代方法。构建优质模型过程中可能面临诸多潜在困难，其主要根源在于嘈杂的真实市场数据与机器学习教程中通常使用的理想数据之间存在巨大差异。本文涵盖建模的所有阶段：初始数据收集、异常值识别、数据模式搜索与分析、价格因子的形成与最终选择、模型构建及其效能评估。针对每个阶段，我们通过实际案例重点说明潜在问题，并描述克服这些困难的有效方法。研究表明，经典线性回归与克里金法（地统计学的插值方法）相结合，能够为土地地块构建有效模型。对于公寓类房产，当大量对象归属于同一空间点时，地统计方法的应用会面临困难。为此，我们提出采用基于决策树自动生成和选择附加规则的线性回归方法，即RuleFit方法。我们将这些本质可解释模型的性能与经过验证的"黑箱"方法——随机森林进行对比，结果显示两者具有相似的预测效果。这表明，尽管可解释性要求（这在法律事务等实际应用中至关重要）构成严格限制，但构建有效的真实房地产市场模型仍然是可行的。