A zero-estimator approach for estimating the signal level in a high-dimensional regression setting

Analysis of high-dimensional data, where the number of covariates is larger than the sample size, is a topic of current interest. In such settings, an important goal is to estimate the signal level $\tau^2$ and noise level $\sigma^2$, i.e., to quantify how much variation in the response variable can be explained by the covariates, versus how much of the variation is left unexplained. This thesis considers the estimation of these quantities in a semi-supervised setting, where for many observations only the vector of covariates $X$ is given with no responses $Y$. Our main research question is: how can one use the unlabeled data to better estimate $\tau^2$ and $\sigma^2$? We consider two frameworks: a linear regression model and a linear projection model in which linearity is not assumed. In the first framework, while linear regression is used, no sparsity assumptions on the coefficients are made. In the second framework, the linearity assumption is also relaxed and we aim to estimate the signal and noise levels defined by the linear projection. We first propose a naive estimator which is unbiased and consistent, under some assumptions, in both frameworks. We then show how the naive estimator can be improved by using zero-estimators, where a zero-estimator is a statistic arising from the unlabeled data, whose expected value is zero. In the first framework, we calculate the optimal zero-estimator improvement and discuss ways to approximate the optimal improvement. In the second framework, such optimality does no longer hold and we suggest two zero-estimators that improve the naive estimator although not necessarily optimally. Furthermore, we show that our approach reduces the variance for general initial estimators and we present an algorithm that potentially improves any initial estimator. Lastly, we consider four datasets and study the performance of our suggested methods.

翻译：高维数据分析（协变量数量大于样本量的情形）是当前研究热点。在此类设定中，核心目标在于估计信号水平τ²与噪声水平σ²，即量化响应变量中可由协变量解释的变异程度与未被解释的变异。本论文研究半监督框架下这些量的估计问题——该框架下大量观测仅包含协变量向量X而缺失响应Y。我们的核心研究问题是：如何利用未标注数据改进τ²与σ²的估计？我们考虑两种框架：线性回归模型与未假设线性关系的线性投影模型。在第一种框架中，虽采用线性回归但不假设系数具有稀疏性；在第二种框架中，进一步放松线性假设，旨在估计由线性投影定义的信号噪声水平。首先，我们在两个框架下提出在特定假设下无偏且一致的朴素估计量，进而展示如何利用零估计量（由未标注数据导出且期望值为零的统计量）改进该估计量。对于第一种框架，我们计算了最优零估计量改进方案，并探讨其近似方法；对于第二种框架，由于最优性不再成立，我们提出两个虽非最优但可改进朴素估计量的零估计量。此外，我们证明该方法能降低任意初始估计量的方差，并提出可改进任意初始估计量的算法。最后，通过四个数据集验证所提方法的性能。