As datasets grow larger, they are often distributed across multiple machines that compute in parallel and communicate with a central machine through short messages. In this paper, we focus on sparse regression and propose a new procedure for conducting selective inference with distributed data. Although many distributed procedures exist for point estimation in the sparse setting, few options are available for estimating uncertainties or conducting hypothesis tests based on the estimated sparsity. We solve a generalized linear regression on each machine, which then communicates a selected set of predictors to the central machine. The central machine uses these selected predictors to form a generalized linear model (GLM). To conduct inference in the selected GLM, our proposed procedure bases approximately-valid selective inference on an asymptotic likelihood. The proposal seeks only aggregated information, in relatively few dimensions, from each machine which is merged at the central machine for selective inference. By reusing low-dimensional summary statistics from local machines, our procedure achieves higher power while keeping the communication cost low. This method is also applicable as a solution to the notorious p-value lottery problem that arises when model selection is repeated on random splits of data.
翻译:随着数据集规模不断增大,数据通常分布在多台机器上并行计算,并通过短消息与中心机器通信。本文聚焦于稀疏回归问题,提出了一种在分布式数据场景下进行选择性推断的新方法。尽管在稀疏设定下存在多种用于点估计的分布式算法,但基于估计稀疏性进行不确定性估计或假设检验的可用方法却很少。每台机器先求解广义线性回归问题,随后将筛选出的预测变量集传输至中心机器。中心机器利用这些筛选出的预测变量构建广义线性模型(GLM)。为对筛选后的GLM进行推断,我们提出的方法基于渐近似然构建近似有效的选择性推断。该方法仅从各机器获取低维度的聚合信息,并在中心机器上合并这些信息以实现选择性推断。通过重复利用本地机器的低维汇总统计量,本方法在保持低通信成本的同时提升了统计功效。该算法还可作为解决著名p值彩票问题(即对数据随机分割重复进行模型选择时产生的问题)的有效方案。