Sliced inverse regression (SIR) is a popular sufficient dimension reduction method that identifies a few linear transformations of the covariates without losing regression information with the response. In high-dimensional settings, SIR can be combined with sparsity penalties to achieve sufficient dimension reduction and variable selection simultaneously. Nevertheless, both classical and sparse estimators assume the covariates are exogenous. However, endogeneity can arise in a variety of situations, such as when variables are omitted or are measured with error. In this article, we show such endogeneity invalidates SIR estimators, leading to inconsistent estimation of the true central subspace. To address this challenge, we propose a two-stage Lasso SIR estimator, which first constructs a sparse high-dimensional instrumental variables model to obtain fitted values of the covariates spanned by the instruments, and then applies SIR augmented with a Lasso penalty on these fitted values. We establish theoretical bounds for the estimation and selection consistency of the true central subspace for the proposed estimators, allowing the number of covariates and instruments to grow exponentially with the sample size. Simulation studies and applications to two real-world datasets in nutrition and genetics illustrate the superior empirical performance of the two-stage Lasso SIR estimator compared with existing methods that disregard endogeneity and/or nonlinearity in the outcome model.
翻译:切片逆回归(SIR)是一种流行的充分降维方法,它能够在不损失响应变量回归信息的前提下,识别出协变量的少数线性变换。在高维设定下,SIR可与稀疏性惩罚相结合,以同时实现充分降维和变量选择。然而,无论是经典估计量还是稀疏估计量,均假设协变量是外生的。然而,内生性可能出现在多种情况下,例如变量被遗漏或存在测量误差。本文证明,此类内生性会使SIR估计量失效,导致对真实中心子空间的不一致估计。为应对这一挑战,我们提出了一种两阶段Lasso SIR估计量。该方法首先构建一个稀疏高维工具变量模型,以获得由工具变量张成的协变量的拟合值,然后对这些拟合值应用辅以Lasso惩罚的SIR。我们为所提估计量对真实中心子空间的估计和选择一致性建立了理论界,允许协变量和工具变量的数量随样本量呈指数增长。模拟研究以及在营养学和遗传学两个真实数据集上的应用表明,相较于那些忽略结果模型中内生性和/或非线性的现有方法,两阶段Lasso SIR估计量具有更优越的实证性能。