The Gaussian Process (GP) is a highly flexible non-linear regression approach that provides a principled approach to handling our uncertainty over predicted (counterfactual) values. It does so by computing a posterior distribution over predicted point as a function of a chosen model space and the observed data, in contrast to conventional approaches that effectively compute uncertainty estimates conditionally on placing full faith in a fitted model. This is especially valuable under conditions of extrapolation or weak overlap, where model dependency poses a severe threat. We first offer an accessible explanation of GPs, and provide an implementation suitable to social science inference problems. In doing so we reduce the number of user-chosen hyperparameters from three to zero. We then illustrate the settings in which GPs can be most valuable: those where conventional approaches have poor properties due to model-dependency/extrapolation in data-sparse regions. Specifically, we apply it to (i) comparisons in which treated and control groups have poor covariate overlap; (ii) interrupted time-series designs, where models are fitted prior to an event by extrapolated after it; and (iii) regression discontinuity, which depends on model estimates taken at or just beyond the edge of their supporting data.
翻译:高斯过程(GP)是一种高度灵活的非线性回归方法,它为处理预测(反事实)值的不确定性提供了原则性途径。该方法通过计算预测点的后验分布来实现这一目标,该分布是所选模型空间与观测数据的函数;而传统方法本质上是在完全信任拟合模型的前提下计算条件性不确定性估计。在外推或弱重叠条件下,模型依赖性构成严重威胁,此时GP方法显得尤为宝贵。我们首先对GP进行了易于理解的阐释,并提供了适用于社会科学推断问题的实现方案。在此过程中,我们将用户需选择的超参数数量从三个减少至零。随后我们阐述了GP最能体现价值的场景:即传统方法因数据稀疏区域的模型依赖性/外推问题而表现不佳的情况。具体而言,我们将其应用于以下场景:(i)处理组与对照组协变量重叠性较差的比较研究;(ii)中断时间序列设计——模型在事件发生前拟合,而后进行外推预测;(iii)回归断点设计——该设计依赖于在数据支撑边界处或略微超出边界处获取的模型估计值。