Unifying Information-Theoretic and Pair-Counting Clustering Similarity

Comparing clusterings is central to evaluating unsupervised models, yet the many existing similarity measures can produce widely divergent, sometimes contradictory, evaluations. Clustering similarity measures are typically organized into two principal families, pair-counting and information-theoretic, reflecting whether they quantify agreement through element pairs or aggregate information across full cluster contingency tables. Prior work has uncovered parallels between these families and applied empirical normalization or chance-correction schemes, but their deeper analytical connection remains only partially understood. Here, we develop an analytical framework that unifies these families through two complementary perspectives. First, both families are expressed as weighted expansions of observed versus expected co-occurrences, with pair-counting arising as a quadratic, low-order approximation and information-theoretic measures as higher-order, frequency-weighted extensions. Second, we generalize pair-counting to k-tuple agreement and show that information-theoretic measures can be viewed as systematically accumulating higher-order co-assignment structure beyond the pairwise level. We illustrate the approaches analytically for the Rand index and Mutual Information, and show how other indices in each family emerge as natural extensions. Together, these views clarify when and why the two regimes diverge, relating their sensitivities directly to weighting and approximation order, and provide a principled basis for selecting, interpreting, and extending clustering similarity measures across applications.

翻译：聚类比较是评估无监督模型的核心环节，然而现有的大量相似性度量方法可能产生显著差异甚至相互矛盾的评估结果。聚类相似性度量通常分为两个主要家族：计数配对家族和信息论家族，前者通过元素对的一致性量化相似性，后者则基于完整聚类列联表的聚合信息。先前研究虽已揭示这两个家族间的平行关系，并应用了经验归一化或机会校正方案，但两者更深层的分析性联系仍未被完全理解。本文提出一个分析框架，通过两种互补视角统一这两个家族。首先，两个家族的度量均可表示为观测共现与期望共现的加权展开，其中计数配对度量表现为二次低阶近似，而信息论度量则为高阶频率加权扩展。其次，我们将计数配对推广至k元组一致性的概念，并证明信息论度量可视为系统性地累积超越配对层次的高阶共分配结构。我们以Rand指数和互信息为例进行理论分析，并展示各家族中其他指数如何自然推演产生。这两个视角共同阐明了两类度量何时及为何产生分歧，将它们的敏感性直接关联至权重设置与近似阶数，为跨应用场景的聚类相似性度量选择、解读与扩展提供了理论依据。