$\textbf{Motivation:}$ Small $p$-values are often required to be accurately estimated in large-scale genomic studies for the adjustment of multiple hypothesis tests and the ranking of genomic features based on their statistical significance. For those complicated test statistics whose cumulative distribution functions are analytically intractable, existing methods usually do not work well with small $p$-values due to lack of accuracy or computational restrictions. We propose a general approach for accurately and efficiently estimating small $p$-values for a broad range of complicated test statistics based on the principle of the cross-entropy method and Markov chain Monte Carlo sampling techniques. $\textbf{Results:}$ We evaluate the performance of the proposed algorithm through simulations and demonstrate its application to three real-world examples in genomic studies. The results show that our approach can accurately evaluate small to extremely small $p$-values (e.g. $10^{-6}$ to $10^{-100}$). The proposed algorithm is helpful for the improvement of some existing test procedures and the development of new test procedures in genomic studies.
翻译:$\textbf{动机:}$ 在大规模基因组研究中,通常需要准确估计小$p$值以进行多重假设检验校正以及基于统计显著性对基因组特征进行排序。对于累积分布函数无法通过解析方法处理的复杂检验统计量,现有方法因精度不足或计算限制,通常难以有效处理小$p$值。我们提出了一种基于交叉熵方法原理和马尔可夫链蒙特卡洛采样技术的通用方法,能够准确高效地估计各类复杂检验统计量的小$p$值。$\textbf{结果:}$ 我们通过模拟实验评估了所提算法的性能,并展示了其在三个基因组研究真实案例中的应用。结果表明,我们的方法能够准确评估小到极小$p$值(例如$10^{-6}$至$10^{-100}$)。该算法有助于改进现有检验程序并推动基因组研究中新检验方法的发展。