Adaptive debiased machine learning using data-driven model selection techniques

Debiased machine learning estimators for smooth functionals in nonparametric models can exhibit substantial variability and instability, often leading practitioners to instead rely on parametric or semiparametric working models. Such models, however, may be misspecified and can therefore introduce bias. We study how data-driven model selection can be combined with debiased machine learning to construct estimators that adapt to structure in the data-generating distribution. To this end, we propose Adaptive Debiased Machine Learning (ADML), a nonparametric framework for constructing superefficient estimators of pathwise differentiable parameters. The framework unifies a broad class of previously proposed adaptive estimators, including methods based on variable selection, learned feature representations, and collaborative targeted learning. It requires only high-level conditions and approximate validity of the selection procedure, which are implied by lower-level conditions already assumed in important settings, including sieve-based selection, sparsity-based methods such as the Lasso, and data-adaptive feature representations. We show that ADML estimators yield regular and efficient root-\(n\) inference for an oracle projection parameter induced by a data-adaptive oracle submodel. This oracle parameter coincides with the target parameter at the true distribution but typically has a smaller efficiency bound, thereby yielding superefficiency for the target parameter. As a practical illustration, we introduce a broad class of automatic ADML estimators for continuous linear functionals of the outcome regression, in which model selection is performed directly on the regression itself. Motivated by overlap challenges in causal inference, we develop new superefficient plug-in estimators for the average treatment effect based on calibration in semiparametric regression models.

翻译：非参数模型中光滑泛函的去偏机器学习估计量可能出现显著的变异性和不稳定性，这常导致实践者转而依赖参数或半参数工作模型。然而此类模型可能因错误设定而产生偏差。我们研究如何将数据驱动的模型选择与去偏机器学习相结合，以构建能自适应数据生成分布结构的估计量。为此，提出自适应去偏机器学习（ADML）——一个用于构建路径可微参数超有效估计量的非参数框架。该框架统一了包括变量选择、学习特征表示和协同目标学习等方法在内的广泛自适应估计量类别。它仅需高层条件及选择过程近似有效性，这些条件可由重要场景中已假设的低层条件推导得出，包含筛基选择、基于稀疏性的方法（如Lasso）以及数据自适应特征表示。我们证明ADML估计量能为数据自适应子模型诱导的投射参数提供正则且有效的根号n推断。该投射参数在真实分布下与目标参数一致，但通常具有更小的效率上界，从而为目标参数产生超有效性。作为实际应用，我们为结果回归的连续线性泛函引入了一类广泛的自动ADML估计量，其中模型选择直接作用于回归函数本身。受因果推断中重叠问题的启发，我们基于半参数回归模型校准开发了用于平均处理效应的新型超有效插入估计量。