This paper investigates the multiple testing problem for high-dimensional sparse binary sequences, motivated by the crowdsourcing problem in machine learning. We study the empirical Bayes approach for multiple testing on the high-dimensional Bernoulli model with a conjugate spike and uniform slab prior. We first show that the hard thresholding rule deduced from the posterior distribution is suboptimal. Consequently, the $\ell$-value procedure constructed using this posterior tends to be overly conservative in estimating the false discovery rate (FDR). We then propose two new procedures based on $\adj\ell$-values and $q$-values to correct this issue. Sharp frequentist theoretical results are obtained, demonstrating that both procedures can effectively control the FDR under sparsity. Numerical experiments are conducted to validate our theory in finite samples. To our best knowledge, this work provides the first uniform FDR control result in multiple testing for high-dimensional sparse binary data.
翻译:本文研究高维稀疏二进制序列的多重检验问题,其研究动机源于机器学习中的众包问题。我们针对高维伯努利模型,采用共轭尖峰与均匀平板先验,探讨了多重检验的贝叶斯经验方法。首先证明,基于后验分布推导的硬阈值规则并非最优,由此构建的ℓ值程序在估计错误发现率(FDR)时往往过于保守。进而提出两种基于修正ℓ值和q值的新程序以纠正该问题。通过尖锐的频率学派理论结果证明,两种程序均能在稀疏条件下有效控制FDR。数值实验在有限样本条件下验证了理论的有效性。据我们所知,本研究首次为高维稀疏二进制数据的多重检验提供了统一的FDR控制结果。