Comparison-based learning addresses the problem of learning when, instead of explicit features or pairwise similarities, one only has access to comparisons of the form: \emph{Object $A$ is more similar to $B$ than to $C$.} Recently, it has been shown that, in Hierarchical Clustering, single and complete linkage can be directly implemented using only such comparisons while several algorithms have been proposed to emulate the behaviour of average linkage. Hence, finding hierarchies (or dendrograms) using only comparisons is a well understood problem. However, evaluating their meaningfulness when no ground-truth nor explicit similarities are available remains an open question. In this paper, we bridge this gap by proposing a new revenue function that allows one to measure the goodness of dendrograms using only comparisons. We show that this function is closely related to Dasgupta's cost for hierarchical clustering that uses pairwise similarities. On the theoretical side, we use the proposed revenue function to resolve the open problem of whether one can approximately recover a latent hierarchy using few triplet comparisons. On the practical side, we present principled algorithms for comparison-based hierarchical clustering based on the maximisation of the revenue and we empirically compare them with existing methods.
翻译:基于比较的学习处理的是在缺乏显式特征或成对相似度时,仅能获得形如“对象A与B的相似度高于与C的相似度”的比较信息的学习问题。近期研究表明,在层次聚类中,单链和全链算法可直接利用此类比较实现,同时已有多种算法被提出以模拟均链算法的行为。因此,仅使用比较来构建层次结构(或树状图)已成为一个成熟的问题。然而,在缺乏真实标签和显式相似度的情况下,评估这些层次结构的意义仍是一个未解难题。本文通过提出一种新的收益函数来弥补这一空白,该函数仅利用比较即可衡量树状图的优劣。我们证明该函数与基于成对相似度的Dasgupta层次聚类代价函数密切相关。在理论层面,我们利用所提收益函数解决了“能否通过少量三元比较近似恢复潜在层次结构”这一开放问题。在实践层面,我们提出了基于收益最大化的原则性比较层次聚类算法,并通过实验与现有方法进行了对比。