It is widely recognised that semiparametric efficient estimation can be hard to achieve in practice: estimators that are in theory efficient may require unattainable levels of accuracy for the estimation of complex nuisance functions. As a consequence, estimators deployed on real datasets are often chosen in a somewhat ad hoc fashion, and may suffer high variance. We study this gap between theory and practice in the context of a broad collection of semiparametric regression models that includes the generalised partially linear model. We advocate using estimators that are robust in the sense that they enjoy $\sqrt{n}$-consistent uniformly over a sufficiently rich class of distributions characterised by certain conditional expectations being estimable by user-chosen machine learning methods. We show that even asking for locally uniform estimation within such a class narrows down possible estimators to those parametrised by certain weight functions. Conversely, we show that such estimators do provide the desired uniform consistency and introduce a novel random forest-based procedure for estimating the optimal weights. We prove that the resulting estimator recovers a notion of $\textbf{ro}$bust $\textbf{s}$emiparametric $\textbf{e}$fficiency (ROSE) and provides a practical alternative to semiparametric efficient estimators. We demonstrate the effectiveness of our ROSE random forest estimator in a variety of semiparametric settings on simulated and real-world data.
翻译:众所周知,半参数有效估计在实践中往往难以实现:理论上有效的估计量可能需要对复杂干扰函数达到难以实现的估计精度。因此,实际数据集中部署的估计量通常以某种临时方式选择,并可能遭受高方差问题。我们在包含广义部分线性模型在内的广泛半参数回归模型集合背景下,研究理论与实践的差距。我们主张采用具有稳健性的估计量,即它们在由某些条件期望可通过用户选择的机器学习方法估计所表征的足够丰富的分布类上,均能保持$\sqrt{n}$一致性。我们证明,即使仅要求在此类分布中获得局部一致估计,也会将可能的估计量范围缩小为由特定权重函数参数化的估计量。反之,我们证明此类估计量确实能提供所需的一致一致性,并引入一种基于随机森林的新颖程序来估计最优权重。我们证明所得估计量恢复了$\textbf{ro}$bust $\textbf{s}$emiparametric $\textbf{e}$fficiency(ROSE)的概念,为半参数有效估计提供了实用替代方案。我们在模拟和真实世界数据的多种半参数场景中,验证了ROSE随机森林估计量的有效性。