Simultaneously performing variable selection and inference in high-dimensional models is an open challenge in statistics and machine learning. The increasing availability of vast amounts of variables requires the adoption of specific statistical procedures to accurately select the most important predictors in a high-dimensional space, while being able to control some form of selection error. In this work we adapt the Mirror Statistic approach to False Discovery Rate (FDR) control into a Bayesian modelling framework. The Mirror Statistic, developed in the classic frequentist statistical framework, is a flexible method to control FDR, which only requires mild model assumptions, but requires two sets of independent regression coefficient estimates, usually obtained after splitting the original dataset. Here we propose to rely on a Bayesian formulation of the model and use the posterior distributions of the coefficients of interest to build the Mirror Statistic and effectively control the FDR without the need to split the data. Moreover, the method is very flexible since it can be used with continuous and discrete outcomes and more complex predictors, such as with mixed models. We keep the approach scalable to high-dimensions by relying on Automatic Differentiation Variational Inference and fully continuous prior choices.
翻译:在高维模型中同时进行变量选择与统计推断是统计学与机器学习领域一个尚未解决的挑战。海量变量的日益增多要求采用特定的统计程序,以在高维空间中准确选择最重要的预测变量,同时能够控制某种形式的选择错误。本研究将错误发现率(FDR)控制的镜像统计方法引入贝叶斯建模框架。镜像统计方法在经典频率统计框架下发展而来,是一种控制FDR的灵活方法,仅需较弱的模型假设,但通常需要通过对原始数据集进行分割来获得两组独立的回归系数估计。本文提出基于模型的贝叶斯形式化表述,利用目标系数的后验分布构建镜像统计量,从而在无需分割数据的情况下有效控制FDR。此外,该方法具有高度灵活性,可适用于连续型和离散型结果变量,并能处理更复杂的预测变量(如混合模型)。通过采用自动微分变分推断与完全连续的先验选择,本方法保持了高维场景下的可扩展性。