Near-optimal multiple testing in Bayesian linear models with finite-sample FDR control

In high dimensional variable selection problems, statisticians often seek to design multiple testing procedures that control the False Discovery Rate (FDR), while concurrently identifying a greater number of relevant variables. Model-X methods, such as Knockoffs and conditional randomization tests, achieve the primary goal of finite-sample FDR control, assuming a known distribution of covariates. However, whether these methods can also achieve the secondary goal of maximizing discoveries remains uncertain. In fact, designing procedures to discover more relevant variables with finite-sample FDR control is a largely open question, even within the arguably simplest linear models. In this paper, we develop near-optimal multiple testing procedures for high dimensional Bayesian linear models with isotropic covariates. We introduce Model-X procedures that provably control the frequentist FDR from finite samples, even when the model is misspecified, and conjecturally achieve near-optimal power when the data follow the Bayesian linear model. Our proposed procedure, PoEdCe, incorporates three key ingredients: Posterior Expectation, distilled Conditional randomization test (dCRT), and the Benjamini-Hochberg procedure with e-values (eBH). The optimality conjecture of PoEdCe is based on a heuristic calculation of its asymptotic true positive proportion (TPP) and false discovery proportion (FDP), which is supported by methods from statistical physics as well as extensive numerical simulations. Our result establishes the Bayesian linear model as a benchmark for comparing the power of various multiple testing procedures.

翻译：在高维变量选择问题中，统计学家通常寻求设计既能控制错误发现率（FDR），又能同时识别更多相关变量的多重检验程序。Model-X方法（如Knockoffs和条件随机化检验）在假设协变量分布已知的前提下，能够实现有限样本FDR控制的首要目标。然而，这些方法能否同时实现最大化发现数这一次要目标仍不确定。事实上，即便在最简单的线性模型中，如何设计在有限样本FDR控制下发现更多相关变量的程序，仍是一个悬而未决的主要问题。本文针对各向同性协变量的高维贝叶斯线性模型，提出了近乎最优的多重检验程序。我们引入了Model-X程序，该程序能在模型设定错误时仍从有限样本中证明性地控制频率学派FDR，并在数据符合贝叶斯线性模型时，推测性地实现近乎最优的检验功效。我们提出的PoEdCe方法融合了三个关键要素：后验期望（Posterior Expectation）、精炼条件随机化检验（distilled Conditional randomization test, dCRT）以及基于e值的Benjamini-Hochberg程序（eBH）。PoEdCe的最优性猜想基于对其渐近真正阳性比例（TPP）和错误发现比例（FDP）的启发式计算，该计算得到了统计物理学方法及大量数值模拟的支持。我们的研究将贝叶斯线性模型确立为比较不同多重检验程序功效的基准框架。