Unsupervised term discovery involves segmenting unlabelled speech into word- or syllable-like units and clustering these into a lexicon of candidate types. True lexicons follow a Zipfian distribution, yet the dominant centre-based clustering approach -- K-means -- produces a more uniform distribution due to an inductive bias toward spherical clusters. In this paper we revisit graph-based clustering as a bottom-up alternative, where segment embeddings are connected by pairwise similarity and partitioned using the Leiden algorithm. We show that graph clustering substantially outperforms centre-based approaches (K-means, GMM, BIRCH) in both word- and syllable-level lexicon discovery across three languages, producing more Zipf-like distributions. Another bottom-up approach, agglomerative clustering with average linkage, also performs well, although it is computationally less efficient and allows for less control over the resulting distribution. Our work calls into question the dominance of centre-based clustering for term discovery, and promotes graph clustering as an attractive alternative.
翻译:无监督术语发现涉及将未标记语音分割成单词或音节级别的单元,并将这些单元聚类为候选类型词典。真实词典遵循齐普夫分布,然而主流的基于中心的聚类方法——K-means由于其朝向球形聚类的归纳偏差,会产生更均匀的分布。在本文中,我们重新审视了基于图的聚类作为一种自下而上的替代方案,其中片段嵌入通过成对相似性连接,并使用Leiden算法进行划分。我们证明,在三种语言中,无论是词级还是音节级词典发现,图聚类都显著优于基于中心的方法(K-means、GMM、BIRCH),产生更接近齐普夫分布的分布。另一种自下而上的方法,即具有平均链接的凝聚聚类,也表现良好,尽管其计算效率较低,且对所得分布的控制能力较弱。我们的工作质疑了基于中心的聚类在术语发现中的主导地位,并推广图聚类作为一种有吸引力的替代方案。