Predicting Census Survey Response Rates With Parsimonious Additive Models and Structured Interactions

In this paper we consider the problem of predicting survey response rates using a family of flexible and interpretable nonparametric models. The study is motivated by the US Census Bureau's well-known ROAM application which uses a linear regression model trained on the US Census Planning Database data to identify hard-to-survey areas. A crowdsourcing competition (Erdman and Bates, 2016) organized around ten years ago revealed that machine learning methods based on ensembles of regression trees led to the best performance in predicting survey response rates; however, the corresponding models could not be adopted for the intended application due to their black-box nature. We consider nonparametric additive models with small number of main and pairwise interaction effects using $\ell_0$-based penalization. From a methodological viewpoint, we study both computational and statistical aspects of our estimator; and discuss variants that incorporate strong hierarchical interactions. Our algorithms (opensourced on github) extend the computational frontiers of existing algorithms for sparse additive models, to be able to handle datasets relevant for the application we consider. We discuss and interpret findings from our model on the US Census Planning Database. In addition to being useful from an interpretability standpoint, our models lead to predictions that appear to be better than popular black-box machine learning methods based on gradient boosting and feedforward neural networks - suggesting that it is possible to have models that have the best of both worlds: good model accuracy and interpretability.

翻译：本文研究利用一类灵活且可解释的非参数模型预测调查回复率。该研究的动机源于美国人口普查局著名的ROAM应用，该应用利用基于美国人口普查规划数据库数据训练的线性回归模型来识别难以调查的区域。约十年前组织的一次众包竞赛（Erdman and Bates, 2016）表明，基于回归树集成方法的机器学习技术在预测调查回复率方面表现最佳；然而，由于这些模型的黑箱特性，其无法应用于预期场景。我们考虑采用基于$\ell_0$惩罚的非参数加性模型，其中包含少量主效应及两两交互效应。从方法论视角，我们研究了所提估计量的计算与统计特性，并讨论了融入强层次交互的变体。我们的算法（开源于GitHub）拓展了现有稀疏加性模型算法的计算边界，能够处理相关应用场景中的数据集。我们基于美国人口普查规划数据库讨论了模型发现及解释。除了在可解释性方面的优势外，我们的模型在预测性能上优于基于梯度提升和前馈神经网络的流行黑箱机器学习方法——这表明我们能够同时兼顾模型精度与可解释性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日