Permutation tests are widely used for statistical hypothesis testing when the sampling distribution of the test statistic under the null hypothesis is analytically intractable or unreliable due to finite sample sizes. One critical challenge in the application of permutation tests in genomic studies is that an enormous number of permutations are often needed to obtain reliable estimates of very small $p$-values, leading to intensive computational effort. To address this issue, we develop algorithms for the accurate and efficient estimation of small $p$-values in permutation tests for paired and independent two-group genomic data, and our approaches leverage a novel framework for parameterizing the permutation sample spaces of those two types of data respectively using the Bernoulli and conditional Bernoulli distributions, combined with the cross-entropy method. The performance of our proposed algorithms is demonstrated through the application to two simulated datasets and two real-world gene expression datasets generated by microarray and RNA-Seq technologies and comparisons to existing methods such as crude permutations and SAMC, and the results show that our approaches can achieve orders of magnitude of computational efficiency gains in estimating small $p$-values. Our approaches offer promising solutions for the improvement of computational efficiencies of existing permutation test procedures and the development of new testing methods using permutations in genomic data analysis.
翻译:置换检验广泛应用于统计假设检验中,当零假设下检验统计量的抽样分布因样本量有限而难以解析计算或不可靠时尤为有效。在基因组研究中应用置换检验面临的一个关键挑战是,为了获得极小p值的可靠估计,通常需要进行巨大数量的置换,这导致计算量极为繁重。针对这一问题,我们开发了用于配对和独立两组基因组数据置换检验中小p值精确且高效估计算法,该方法通过创新的框架分别利用伯努利分布和条件伯努利分布对两种数据类型的置换样本空间进行参数化,并结合交叉熵方法实现。通过两个模拟数据集和两个分别由微阵列和RNA测序技术生成的真实基因表达数据集的实证应用,以及与原始置换方法和SAMC等现有方法的比较,我们提出的算法在估计小p值时能实现数量级的计算效率提升。该方法为改进现有置换检验程序的计算效率、以及开发基因组数据分析中基于置换的新检验方法提供了有前景的解决方案。