Multiple hypothesis testing with false discovery rate (FDR) control is a fundamental problem in statistical inference, with broad applications in genomics, drug screening, and outlier detection. In many such settings, researchers may have access not only to real experimental observations but also to auxiliary or synthetic data -- from past, related experiments or generated by generative models -- that can provide additional evidence about the hypotheses of interest. We introduce SynthBH, a synthetic-powered multiple testing procedure that safely leverages such synthetic data. We prove that SynthBH guarantees finite-sample, distribution-free FDR control under a mild PRDS-type positive dependence condition, without requiring the pooled-data p-values to be valid under the null. The proposed method adapts to the (unknown) quality of the synthetic data: it enhances the sample efficiency and may boost the power when synthetic data are of high quality, while controlling the FDR at a user-specified level regardless of their quality. We demonstrate the empirical performance of SynthBH on tabular outlier detection benchmarks and on genomic analyses of drug-cancer sensitivity associations, and further study its properties through controlled experiments on simulated data.
翻译:多重假设检验中的错误发现率(FDR)控制是统计推断中的一个基本问题,在基因组学、药物筛选和异常检测等领域具有广泛应用。在许多此类场景中,研究者不仅能够获得真实的实验观测数据,还可能拥有来自过往相关实验或生成模型产生的辅助数据或合成数据——这些数据能为所关注的假设提供额外证据。本文提出了SynthBH,一种合成数据驱动的多重检验方法,能够安全地利用此类合成数据。我们证明,在温和的PRDS型正相关依赖条件下,SynthBH能够保证有限样本、分布自由的FDR控制,且无需要求合并数据的p值在原假设下有效。所提出的方法能够自适应(未知的)合成数据质量:当合成数据质量较高时,它能提升样本效率并可能增强检验功效;同时无论合成数据质量如何,都能将FDR控制在用户指定的水平。我们在表格异常检测基准测试以及药物-癌症敏感性关联的基因组分析中验证了SynthBH的实证性能,并通过模拟数据的受控实验进一步研究了其特性。