Binned scatter plots are a powerful statistical tool for empirical work in the social, behavioral, and biomedical sciences. Available methods rely on a quantile-based partitioning estimator of the conditional mean regression function to primarily construct flexible yet interpretable visualization methods, but they can also be used to estimate treatment effects, assess uncertainty, and test substantive domain-specific hypotheses. This paper introduces novel binscatter methods based on nonlinear, possibly nonsmooth M-estimation methods, covering generalized linear, robust, and quantile regression models. We provide a host of theoretical results and practical tools for local constant estimation along with piecewise polynomial and spline approximations, including (i) optimal tuning parameter (number of bins) selection, (ii) confidence bands, and (iii) formal statistical tests regarding functional form or shape restrictions. Our main results rely on novel strong approximations for general partitioning-based estimators covering random, data-driven partitions, which may be of independent interest. We demonstrate our methods with an empirical application studying the relation between the percentage of individuals without health insurance and per capita income at the zip-code level. We provide general-purpose software packages implementing our methods in Python, R, and Stata.
翻译:分箱散点图是社会科学、行为科学和生物医学领域实证研究中一种强大的统计工具。现有方法主要基于条件均值回归函数的分位数划分估计量来构建灵活且可解释的可视化方法,同时也可用于估计处理效应、评估不确定性以及检验特定领域的实质性假设。本文提出了基于非线性(可能非光滑)M估计方法的新型分箱散点图方法,涵盖广义线性回归、稳健回归和分位数回归模型。我们为局部常数估计及分段多项式与样条逼近提供了一系列理论结果和实用工具,包括:(i) 最优调参(分箱数量)选择,(ii) 置信带构建,以及(iii) 关于函数形式或形状约束的正式统计检验。我们的主要结果依赖于针对随机、数据驱动划分的一般划分估计量的新型强逼近理论,该理论可能具有独立的研究价值。我们通过研究邮政编码层级未参保人口比例与人均收入关系的实证应用来展示所提方法。我们开发了在Python、R和Stata中实现本方法的通用软件包。