A step towards the integration of machine learning and small area estimation

The use of machine-learning techniques has grown in numerous research areas. Currently, it is also widely used in statistics, including the official statistics for data collection (e.g. satellite imagery, web scraping and text mining, data cleaning, integration and imputation) but also for data analysis. However, the usage of these methods in survey sampling including small area estimation is still very limited. Therefore, we propose a predictor supported by these algorithms which can be used to predict any population or subpopulation characteristics based on cross-sectional and longitudinal data. Machine learning methods have already been shown to be very powerful in identifying and modelling complex and nonlinear relationships between the variables, which means that they have very good properties in case of strong departures from the classic assumptions. Therefore, we analyse the performance of our proposal under a different set-up, in our opinion of greater importance in real-life surveys. We study only small departures from the assumed model, to show that our proposal is a good alternative in this case as well, even in comparison with optimal methods under the model. What is more, we propose the method of the accuracy estimation of machine learning predictors, giving the possibility of the accuracy comparison with classic methods, where the accuracy is measured as in survey sampling practice. The solution of this problem is indicated in the literature as one of the key issues in integration of these approaches. The simulation studies are based on a real, longitudinal dataset, freely available from the Polish Local Data Bank, where the prediction problem of subpopulation characteristics in the last period, with "borrowing strength" from other subpopulations and time periods, is considered.

翻译：机器学习技术在众多研究领域中的应用日益增长。目前，它也广泛应用于统计学领域，包括官方统计中用于数据收集（如卫星图像、网络爬取和文本挖掘、数据清洗、整合与插补）及数据分析。然而，这些方法在调查抽样（包括小区域估计）中的使用仍然非常有限。因此，我们提出一种基于这些算法的预测器，可用于基于横截面数据和纵向数据预测任何总体或子总体特征。机器学习方法已被证明在识别和建模变量之间的复杂非线性关系方面非常强大，这意味着在经典假设严重偏离的情况下，它们具有非常良好的特性。因此，我们在不同的设置下分析我们提出的方法的性能，我们认为这些设置在实际调查中更为重要。我们只研究假设模型的小幅偏离，以表明即使在最优模型方法比较中，我们的方法在这种情况下也是一种良好的替代方案。此外，我们提出了机器学习预测器的精度估计方法，使得能够与经典方法进行精度比较，其中精度按照调查抽样实践的度量方式。文献中将该问题的解决方案视为这些方法融合的关键问题之一。模拟研究基于一个真实的纵向数据集，该数据集可免费从波兰地方数据银行获取，其中考虑了在最后一个时期如何通过“借用其他子总体和时间段的信息”来预测子总体特征的问题。