We study confidence interval construction for linear regression under Huber's contamination model, where an unknown fraction of noise variables is arbitrarily corrupted. While robust point estimation in this setting is well understood, statistical inference remains challenging, especially because the contamination proportion is not identifiable from the data. We develop a new algorithm that constructs confidence intervals for individual regression coefficients without any prior knowledge of the contamination level. Our method is based on a Z-estimation framework using a smooth estimating function. The method directly quantifies the uncertainty of the estimating equation after a preprocessing step that decorrelates covariates associated with the nuisance parameters. We show that the resulting confidence interval has valid coverage uniformly over all contamination distributions and attains an optimal length of order $O(1/\sqrt{n(1-ε)^2})$, matching the rate achievable when the contamination proportion $ε$ is known. This result stands in sharp contrast to the adaptation cost of robust interval estimation observed in the simpler Gaussian location model.
翻译:我们研究了Huber污染模型下线性回归的置信区间构建问题,其中未知比例的噪声变量受到任意污染。尽管该设定下的稳健点估计已得到充分理解,但统计推断仍具挑战性,尤其是因为污染比例无法从数据中识别。我们开发了一种新算法,无需任何污染水平的先验知识即可构建单个回归系数的置信区间。该方法基于Z估计框架,采用平滑估计函数。通过预处理步骤将干扰参数相关协变量去相关后,该方法直接量化了估计方程的不确定性。我们证明,所得置信区间在所有污染分布下都具有有效覆盖,并达到$O(1/\sqrt{n(1-ε)^2})$阶的最优长度,与污染比例ε已知时可实现的速率相匹配。这一结果与更简单的高斯位置模型中稳健区间估计的适应代价形成了鲜明对比。