Optimal Nuisance Function Tuning for Estimating a Doubly Robust Functional under Proportional Asymptotics

In this paper, we explore the asymptotically optimal tuning parameter choice in ridge regression for estimating nuisance functions of a statistical functional that has recently gained prominence in conditional independence testing and causal inference. Given a sample of size $n$, we study estimators of the Expected Conditional Covariance (ECC) between variables $Y$ and $A$ given a high-dimensional covariate $X \in \mathbb{R}^p$. Under linear regression models for $Y$ and $A$ on $X$ and the proportional asymptotic regime $p/n \to c \in (0, \infty)$, we evaluate three existing ECC estimators and two sample splitting strategies for estimating the required nuisance functions. Since no consistent estimator of the nuisance functions exists in the proportional asymptotic regime without imposing further structure on the problem, we first derive debiased versions of the ECC estimators that utilize the ridge regression nuisance function estimators. We show that our bias correction strategy yields $\sqrt{n}$-consistent estimators of the ECC across different sample splitting strategies and estimator choices. We then derive the asymptotic variances of these debiased estimators to illustrate the nuanced interplay between the sample splitting strategy, estimator choice, and tuning parameters of the nuisance function estimators for optimally estimating the ECC. Our analysis reveals that prediction-optimal tuning parameters (i.e., those that optimally estimate the nuisance functions) may not lead to the lowest asymptotic variance of the ECC estimator -- thereby demonstrating the need to be careful in selecting tuning parameters based on the final goal of inference. Finally, we verify our theoretical results through extensive numerical experiments.

翻译：本文探讨了在估计统计泛函的干扰函数时，岭回归中渐近最优调参选择的问题，该泛函最近在条件独立性检验和因果推断中受到广泛关注。给定样本量为 $n$，我们研究了在给定高维协变量 $X \in \mathbb{R}^p$ 下，变量 $Y$ 和 $A$ 之间期望条件协方差（ECC）的估计量。在 $Y$ 和 $A$ 关于 $X$ 的线性回归模型以及比例渐近框架 $p/n \to c \in (0, \infty)$ 下，我们评估了三种现有的 ECC 估计量以及两种用于估计所需干扰函数的样本分割策略。由于在不对问题施加额外结构的情况下，比例渐近框架中不存在干扰函数的一致估计量，我们首先推导了利用岭回归干扰函数估计量的 ECC 估计量的去偏版本。我们证明了我们的偏差校正策略在不同的样本分割策略和估计量选择下，均能产生 $\sqrt{n}$-相合的 ECC 估计量。随后，我们推导了这些去偏估计量的渐近方差，以阐明样本分割策略、估计量选择以及干扰函数估计量的调参之间，为最优估计 ECC 而存在的微妙相互作用。我们的分析表明，预测最优的调参（即能最优估计干扰函数的参数）可能不会导致 ECC 估计量的渐近方差最小——这证明了需要根据推断的最终目标谨慎选择调参。最后，我们通过广泛的数值实验验证了我们的理论结果。