Short text clustering has gained significant attention in the data mining community. However, the limited valuable information contained in short texts often leads to low-discriminative representations, increasing the difficulty of clustering. This paper proposes a novel short text clustering framework, called Reliable \textbf{P}seudo-labeling via \textbf{O}ptimal \textbf{T}ransport with \textbf{A}ttention for Short Text Clustering (\textbf{POTA}), that generate reliable pseudo-labels to aid discriminative representation learning for clustering. Specially, \textbf{POTA} first implements an instance-level attention mechanism to capture the semantic relationships among samples, which are then incorporated as a regularization term into an optimal transport problem. By solving this OT problem, we can yield reliable pseudo-labels that simultaneously account for sample-to-sample semantic consistency and sample-to-cluster global structure information. Additionally, the proposed OT can adaptively estimate cluster distributions, making \textbf{POTA} well-suited for varying degrees of imbalanced datasets. Then, we utilize the pseudo-labels to guide contrastive learning to generate discriminative representations and achieve efficient clustering. Extensive experiments demonstrate \textbf{POTA} outperforms state-of-the-art methods. The code is available at: \href{https://github.com/YZH0905/POTA-STC/tree/main}{https://github.com/YZH0905/POTA-STC/tree/main}.
翻译:短文本聚类在数据挖掘领域受到广泛关注。然而,短文本所包含的有效信息有限,往往导致表征的区分度不足,从而增加了聚类难度。本文提出了一种新颖的短文本聚类框架——基于注意力机制与最优传输的可靠伪标签生成方法(简称POTA),该框架通过生成可靠的伪标签来辅助聚类任务中的判别性表征学习。具体而言,POTA首先采用实例级注意力机制捕捉样本间的语义关联,并将这些关联作为正则化项引入最优传输问题中。通过求解该最优传输问题,我们能够获得同时兼顾样本间语义一致性与样本-聚类全局结构信息的可靠伪标签。此外,所提出的最优传输方法能够自适应地估计簇分布,使得POTA能够很好地适用于不同不平衡程度的数据集。随后,我们利用生成的伪标签指导对比学习,以产生判别性表征并实现高效聚类。大量实验表明,POTA的性能优于当前最先进的方法。相关代码已发布于:https://github.com/YZH0905/POTA-STC/tree/main。