In this paper we consider the problem of predicting survey response rates using a family of flexible and interpretable nonparametric models. The study is motivated by the US Census Bureau's well-known ROAM application which uses a linear regression model trained on the US Census Planning Database data to identify hard-to-survey areas. A crowdsourcing competition (Erdman and Bates, 2016) organized around ten years ago revealed that machine learning methods based on ensembles of regression trees led to the best performance in predicting survey response rates; however, the corresponding models could not be adopted for the intended application due to their black-box nature. We consider nonparametric additive models with small number of main and pairwise interaction effects using $\ell_0$-based penalization. From a methodological viewpoint, we study both computational and statistical aspects of our estimator; and discuss variants that incorporate strong hierarchical interactions. Our algorithms (opensourced on github) extend the computational frontiers of existing algorithms for sparse additive models, to be able to handle datasets relevant for the application we consider. We discuss and interpret findings from our model on the US Census Planning Database. In addition to being useful from an interpretability standpoint, our models lead to predictions that appear to be better than popular black-box machine learning methods based on gradient boosting and feedforward neural networks - suggesting that it is possible to have models that have the best of both worlds: good model accuracy and interpretability.
翻译:本文研究利用一类灵活且可解释的非参数模型预测调查回复率。该研究的动机源于美国人口普查局著名的ROAM应用,该应用利用基于美国人口普查规划数据库数据训练的线性回归模型来识别难以调查的区域。约十年前组织的一次众包竞赛(Erdman and Bates, 2016)表明,基于回归树集成方法的机器学习技术在预测调查回复率方面表现最佳;然而,由于这些模型的黑箱特性,其无法应用于预期场景。我们考虑采用基于$\ell_0$惩罚的非参数加性模型,其中包含少量主效应及两两交互效应。从方法论视角,我们研究了所提估计量的计算与统计特性,并讨论了融入强层次交互的变体。我们的算法(开源于GitHub)拓展了现有稀疏加性模型算法的计算边界,能够处理相关应用场景中的数据集。我们基于美国人口普查规划数据库讨论了模型发现及解释。除了在可解释性方面的优势外,我们的模型在预测性能上优于基于梯度提升和前馈神经网络的流行黑箱机器学习方法——这表明我们能够同时兼顾模型精度与可解释性。