The debiased estimator is a crucial tool in statistical inference for high-dimensional model parameters. However, constructing such an estimator involves estimating the high-dimensional inverse Hessian matrix, incurring significant computational costs. This challenge becomes particularly acute in distributed setups, where traditional methods necessitate computing a debiased estimator on every machine. This becomes unwieldy, especially with a large number of machines. In this paper, we delve into semi-supervised sparse statistical inference in a distributed setup. An efficient multi-round distributed debiased estimator, which integrates both labeled and unlabelled data, is developed. We will show that the additional unlabeled data helps to improve the statistical rate of each round of iteration. Our approach offers tailored debiasing methods for $M$-estimation and generalized linear models according to the specific form of the loss function. Our method also applies to a non-smooth loss like absolute deviation loss. Furthermore, our algorithm is computationally efficient since it requires only one estimation of a high-dimensional inverse covariance matrix. We demonstrate the effectiveness of our method by presenting simulation studies and real data applications that highlight the benefits of incorporating unlabeled data.
翻译:去偏估计量是高维模型参数统计推断中的关键工具。然而,构建此类估计量需要估计高维逆海森矩阵,会带来显著的计算成本。这一挑战在分布式设置中尤为严峻,传统方法要求每台机器都计算去偏估计量,当机器数量庞大时操作变得极为繁琐。本文深入研究了分布式环境下的半监督稀疏统计推断,提出了一种高效的多轮分布式去偏估计量,该估计量整合了标注数据与非标注数据。我们将证明,额外加入的非标注数据有助于提升每轮迭代的统计速率。根据损失函数的具体形式,我们的方法为M-估计和广义线性模型提供了定制化的去偏手段,该方法同样适用于绝对偏差损失等非光滑损失函数。此外,由于算法仅需估计一次高维逆协方差矩阵,因此计算效率显著。通过模拟研究与真实数据应用,我们展示了纳入非标注数据带来的益处,验证了该方法的有效性。