Contrastive learning has gained significant attention in short text clustering, yet it has an inherent drawback of mistakenly identifying samples from the same category as negatives and then separating them in the feature space (false negative separation), which hinders the generation of superior representations. To generate more discriminative representations for efficient clustering, we propose a novel short text clustering method, called Discriminative Representation learning via \textbf{A}ttention-\textbf{E}nhanced \textbf{C}ontrastive \textbf{L}earning for Short Text Clustering (\textbf{AECL}). The \textbf{AECL} consists of two modules which are the pseudo-label generation module and the contrastive learning module. Both modules build a sample-level attention mechanism to capture similarity relationships between samples and aggregate cross-sample features to generate consistent representations. Then, the former module uses the more discriminative consistent representation to produce reliable supervision information for assist clustering, while the latter module explores similarity relationships and consistent representations optimize the construction of positive samples to perform similarity-guided contrastive learning, effectively addressing the false negative separation issue. Experimental results demonstrate that the proposed \textbf{AECL} outperforms state-of-the-art methods. If the paper is accepted, we will open-source the code.
翻译:对比学习在短文本聚类领域已获得显著关注,但其存在一个固有缺陷:错误地将同一类别的样本识别为负例,进而在特征空间中将其分离(伪负例分离),这阻碍了生成更优表征。为生成更具判别性的表征以实现高效聚类,我们提出一种新颖的短文本聚类方法,称为基于**A**ttention-**E**nhanced **C**ontrastive **L**earning的判别性表征学习用于短文本聚类(**AECL**)。**AECL**包含两个模块:伪标签生成模块与对比学习模块。两个模块均构建了样本级注意力机制,以捕获样本间的相似性关系并聚合跨样本特征,从而生成一致性表征。随后,前一模块利用更具判别性的一致性表征产生可靠的监督信息以辅助聚类;而后一模块则利用相似性关系与一致性表征优化正样本构建,以执行相似性引导的对比学习,有效解决伪负例分离问题。实验结果表明,所提出的**AECL**方法优于现有最先进方法。若论文被录用,我们将开源相关代码。