Donoho and Kipnis (2022) showed that the the higher criticism (HC) test statistic has a non-Gaussian phase transition but remarked that it is probably not optimal, in the detection of sparse differences between two large frequency tables when the counts are low. The setting can be considered to be heterogeneous, with cells containing larger total counts more able to detect smaller differences. We provide a general study here of sparse detection arising from such heterogeneous settings, and showed that optimality of the HC test statistic requires thresholding, for example in the case of frequency table comparison, to restrict to p-values of cells with total counts exceeding a threshold. The use of thresholding also leads to optimality of the HC test statistic when it is applied on the sparse Poisson means model of Arias-Castro and Wang (2015). The phase transitions we consider here are non-Gaussian, and involve an interplay between the rate functions of the response and sample size distributions. We also showed, both theoretically and in a numerical study, that applying thresholding to the Bonferroni test statistic results in better sparse mixture detection in heterogeneous settings.
翻译:Donoho和Kipnis(2022)证明了更高临界(HC)检验统计量存在非高斯相变,但指出在低频条件下检测两个大型频数表之间的稀疏差异时,该统计量可能并非最优。该设定可视为异质的,其中包含更大总频数的单元格能够探测到更小的差异。本文对由此类异质设定引发的稀疏检测问题进行了系统研究,并表明HC检验统计量的最优性需要阈值化处理——例如在频数表比较场景中,需将p值限制在总频数超过阈值的单元格。当将该统计量应用于Arias-Castro和Wang(2015)提出的稀疏泊松均值模型时,阈值化处理同样可确保HC检验统计量的最优性。我们在此讨论的相变属于非高斯类型,涉及响应速率函数与样本量分布之间的相互作用。通过理论分析与数值研究,我们进一步表明:在异质设定中,对Bonferroni检验统计量施加阈值化处理能更有效地实现稀疏混合检测。