A zero-estimator approach for estimating the signal level in a high-dimensional regression setting

Analysis of high-dimensional data, where the number of covariates is larger than the sample size, is a topic of current interest. In such settings, an important goal is to estimate the signal level $\tau^2$ and noise level $\sigma^2$, i.e., to quantify how much variation in the response variable can be explained by the covariates, versus how much of the variation is left unexplained. This thesis considers the estimation of these quantities in a semi-supervised setting, where for many observations only the vector of covariates $X$ is given with no responses $Y$. Our main research question is: how can one use the unlabeled data to better estimate $\tau^2$ and $\sigma^2$? We consider two frameworks: a linear regression model and a linear projection model in which linearity is not assumed. In the first framework, while linear regression is used, no sparsity assumptions on the coefficients are made. In the second framework, the linearity assumption is also relaxed and we aim to estimate the signal and noise levels defined by the linear projection. We first propose a naive estimator which is unbiased and consistent, under some assumptions, in both frameworks. We then show how the naive estimator can be improved by using zero-estimators, where a zero-estimator is a statistic arising from the unlabeled data, whose expected value is zero. In the first framework, we calculate the optimal zero-estimator improvement and discuss ways to approximate the optimal improvement. In the second framework, such optimality does no longer hold and we suggest two zero-estimators that improve the naive estimator although not necessarily optimally. Furthermore, we show that our approach reduces the variance for general initial estimators and we present an algorithm that potentially improves any initial estimator. Lastly, we consider four datasets and study the performance of our suggested methods.

翻译：高维数据分析（协变量数目大于样本量的情形）是当前研究的热点。在此类设定中，估计信号水平$\tau^2$和噪声水平$\sigma^2$（即量化响应变量的变异中由协变量解释的部分与未解释部分的比例）是重要目标。本论文研究半监督设定下这些量的估计问题——其中大量观测仅有协变量向量$X$而无对应响应$Y$。核心研究问题为：如何利用未标注数据更准确地估计$\tau^2$和$\sigma^2$？我们考虑两种框架：线性回归模型和未假定线性性的线性投影模型。第一框架中，虽使用线性回归但不假设系数具有稀疏性；第二框架中，同时放宽线性性假设，估计由线性投影定义的信号与噪声水平。首先提出一个朴素估计量，该估计量在两种框架的特定假设下无偏且一致。继而展示如何利用零估计量改进朴素估计量——零估计量是由未标注数据导出且期望值为零的统计量。在第一框架中，我们计算零估计量的最优改进量，并探讨近似最优改进的方法。第二框架中，最优性不再成立，我们提出两个虽非最优但能改进朴素估计量的零估计量。进一步证明该方法可降低一般初始估计量的方差，并提出能潜在优化任意初始估计量的算法。最后，通过四个数据集评估所提方法的性能。