If the same data is used for both clustering and for testing a null hypothesis that is formulated in terms of the estimated clusters, then the traditional hypothesis testing framework often fails to control the Type I error. Gao et al. [2022] and Chen and Witten [2023] provide selective inference frameworks for testing if a pair of estimated clusters indeed stem from underlying differences, for the case where hierarchical clustering and K-means clustering, respectively, are used to define the clusters. In applications, however, it is often of interest to test for multiple pairs of clusters. In our work, we extend the pairwise test of Chen and Witten [2023] to a test for multiple pairs of clusters, where the cluster assignments are produced by K-means clustering. We further develop an analogous test for the setting where the variance is unknown, building on the work of Yun and Barber [2023] that extends Gao et al. [2022]'s pairwise test to the case of unknown variance. For both known and unknown variance settings, we present methods that address certain forms of data-dependence in the choice of pairs of clusters to test for. We show that our proposed tests control the Type I error, both theoretically and empirically, and provide a numerical study of their empirical powers under various settings.
翻译:若使用相同数据进行聚类并检验基于估计簇构建的零假设,传统假设检验框架通常无法控制第一类错误。Gao等人[2022]与Chen和Witten[2023]分别针对层次聚类和K-means聚类的情形,提出了检验估计簇对是否确实源于潜在差异的选择性推断框架。然而在实际应用中,经常需要对多对簇进行检验。本研究将Chen和Witten[2023]的成对检验扩展至多对簇检验框架,其中簇分配由K-means聚类生成。基于Yun和Barber[2023]将Gao等人[2022]的成对检验扩展至未知方差情形的工作,我们进一步开发了适用于未知方差场景的类比检验方法。针对已知方差与未知方差两种设定,我们提出了能够处理待检验簇对选择中特定形式数据依赖性的方法。我们通过理论与实证证明所提检验方法能有效控制第一类错误,并在多种设定下对其统计功效进行了数值模拟研究。