In this paper, we propose an intuitive, training-free and label-free method for intent clustering in conversational search. Current approaches to short text clustering use LLM-generated pseudo-labels to enrich text representations or to identify similar text pairs for pooling. The limitations are: (1) each text is assigned only a single label, and refining representations toward a single label can be unstable; (2) text-level similarity is treated as a binary selection, which fails to account for continuous degrees of similarity. Our method LUMI is designed to amplify similarities between texts by using shared pseudo-labels. We first generate pseudo-labels for each text and collect them into a pseudo-label set. Next, we compute the mean of the pseudo-label embeddings and pool it with the text embedding. Finally, we perform text-level pooling: Each text representation is pooled with its similar pairs, where similarity is determined by the degree of shared labels. Our evaluation on four benchmark sets shows that our approach achieves competitive results, better than recent state-of-the-art baselines, while avoiding the need to estimate the number of clusters during embedding refinement, as is required by most methods. Our findings indicate that LUMI can effectively be applied in unsupervised short-text clustering scenarios.
翻译:本文提出了一种直观、无需训练且无需标注的对话搜索意图聚类方法。现有短文本聚类方法利用大语言模型生成的伪标签来丰富文本表示或识别相似文本对以进行池化。其局限性在于:(1) 每个文本仅被分配单一标签,而向单一标签优化表示可能不稳定;(2) 文本级相似度被处理为二元选择,未能考虑连续性的相似程度。我们的方法LUMI通过共享伪标签来增强文本间相似性。我们首先为每个文本生成伪标签并收集至伪标签集合,随后计算伪标签嵌入的均值并与文本嵌入进行池化。最后执行文本级池化:每个文本表示与其相似文本对进行池化,相似度由共享标签的程度决定。在四个基准数据集上的评估表明,本方法取得了具有竞争力的结果,优于当前最先进的基线方法,同时避免了大多数方法在嵌入优化过程中需要预估聚类数量的需求。我们的研究结果表明,LUMI能有效应用于无监督短文本聚类场景。