We consider a novel Bayesian approach to estimation, uncertainty quantification, and variable selection for a high-dimensional linear regression model under sparsity. The number of predictors can be nearly exponentially large relative to the sample size. We put a conjugate normal prior initially disregarding sparsity, but for making an inference, instead of the original multivariate normal posterior, we use the posterior distribution induced by a map transforming the vector of regression coefficients to a sparse vector obtained by minimizing the sum of squares of deviations plus a suitably scaled $\ell_1$-penalty on the vector. We show that the resulting sparse projection-posterior distribution contracts around the true value of the parameter at the optimal rate adapted to the sparsity of the vector. We show that the true sparsity structure gets a large sparse projection-posterior probability. We further show that an appropriately recentred credible ball has the correct asymptotic frequentist coverage. Finally, we describe how the computational burden can be distributed to many machines, each dealing with only a small fraction of the whole dataset. We conduct a comprehensive simulation study under a variety of settings and found that the proposed method performs well for finite sample sizes. We also apply the method to several real datasets, including the ADNI data, and compare its performance with the state-of-the-art methods. We implemented the method in the \texttt{R} package called \texttt{sparseProj}, and all computations have been carried out using this package.
翻译:本文提出了一种用于高维稀疏线性回归模型的贝叶斯估计、不确定性量化和变量选择新方法。预测变量的数量可接近样本量的指数级规模。我们首先采用忽略稀疏性的共轭正态先验,但在进行统计推断时,不使用原始多元正态后验,而是通过映射变换诱导的后验分布——该映射将回归系数向量转化为通过最小化残差平方和加上适当缩放向量$\ell_1$惩罚项所获得的稀疏向量。我们证明所得的稀疏投影后验分布能以适应向量稀疏性的最优速率收缩至参数真值,并证实真实稀疏结构能获得较大的稀疏投影后验概率。进一步证明经过适当重中心化的可信球具有正确的渐近频率覆盖率。最后,我们阐述了如何将计算负担分配到多台机器,每台机器仅处理整个数据集的极小部分。通过多种设定下的综合模拟研究,发现所提方法在有限样本量下表现优异。我们将该方法应用于包括ADNI数据在内的多个真实数据集,并与前沿方法进行性能比较。本方法已通过\texttt{R}语言包\texttt{sparseProj}实现,所有计算均基于该软件包完成。