Motivated by applications in text mining and discrete distribution inference, we investigate the testing for equality of probability mass functions of $K$ groups of high-dimensional multinomial distributions. A test statistic, which is shown to have an asymptotic standard normal distribution under the null, is proposed. The optimal detection boundary is established, and the proposed test is shown to achieve this optimal detection boundary across the entire parameter space of interest. The proposed method is demonstrated in simulation studies and applied to analyze two real-world datasets to examine variation among consumer reviews of Amazon movies and diversity of statistical paper abstracts.
翻译:受文本挖掘和离散分布推断应用的启发,我们研究了对$K$组高维多项分布概率质量函数相等性的检验问题。我们提出了一种检验统计量,并证明在原假设下该统计量渐近服从标准正态分布。我们建立了最优检测边界,并证明所提出的检验能在整个感兴趣参数空间内达到该最优检测边界。通过模拟研究验证了所提方法的有效性,并将其应用于分析两个真实数据集:检验亚马逊电影消费者评论的变异性以及统计论文摘要的多样性。