Classical inference methods notoriously fail when applied to data-driven test hypotheses or inference targets. Instead, dedicated methodologies are required to obtain statistical guarantees for these selective inference problems. Selective inference is particularly relevant post-clustering, typically when testing a difference in mean between two clusters. In this paper, we address convex clustering with $\ell_1$ penalization, by leveraging related selective inference tools for regression, based on Gaussian vectors conditioned to polyhedral sets. In the one-dimensional case, we prove a polyhedral characterization of obtaining given clusters, than enables us to suggest a test procedure with statistical guarantees. This characterization also allows us to provide a computationally efficient regularization path algorithm. Then, we extend the above test procedure and guarantees to multi-dimensional clustering with $\ell_1$ penalization, and also to more general multi-dimensional clusterings that aggregate one-dimensional ones. With various numerical experiments, we validate our statistical guarantees and we demonstrate the power of our methods to detect differences in mean between clusters. Our methods are implemented in the R package poclin.
翻译:经典推断方法在应用于数据驱动的检验假设或推断目标时通常会失效。相反,需要专门的方法来为这些选择性推断问题提供统计保证。选择性推断在聚类后尤为重要,通常用于检验两个聚类之间的均值差异。在本文中,我们通过利用基于高斯向量条件于多面体集的相关回归选择性推断工具,研究了带$\ell_1$惩罚的凸聚类。在一维情形下,我们证明了获得给定聚类的多面体刻画,这使我们能够提出一种具有统计保证的检验程序。该刻画还允许我们提供一种计算高效的正则化路径算法。然后,我们将上述检验程序及其保证扩展到带$\ell_1$惩罚的多维聚类,以及更一般的聚合一维聚类结果的多维聚类方法。通过多种数值实验,我们验证了统计保证的有效性,并展示了方法在检测聚类间均值差异方面的能力。我们的方法已在R语言包poclin中实现。