In this paper, we propose an intuitive, training-free and label-free method for intent clustering that makes minimal assumptions using lightweight and open-source LLMs. Many current approaches rely on commercial LLMs, which are costly, and offer limited transparency. Additionally, their methods often explicitly depend on knowing the number of clusters in advance, which is often not the case in realistic settings. To address these challenges, instead of asking the LLM to match similar text directly, we first ask it to generate pseudo-labels for each text, and then perform multi-label classification in this pseudo-label set for each text. This approach is based on the hypothesis that texts belonging to the same cluster will share more labels, and will therefore be closer when encoded into embeddings. These pseudo-labels are more human-readable than direct similarity matches. Our evaluation on four benchmark sets shows that our approach achieves results comparable to and better than recent baselines, while remaining simple and computationally efficient. Our findings indicate that our method can be applied in low-resource scenarios and is stable across multiple models and datasets.
翻译:本文提出了一种直观、无需训练且无需标签的意图聚类方法,该方法使用轻量级开源大语言模型,并做出最小假设。当前许多方法依赖商业大语言模型,这些模型成本高昂且透明度有限。此外,这些方法通常明确要求预先知道聚类数量,而这在实际场景中往往无法满足。为解决这些挑战,我们不再要求大语言模型直接匹配相似文本,而是首先让其为每个文本生成伪标签,然后在该伪标签集中为每个文本执行多标签分类。该方法基于以下假设:属于同一聚类的文本将共享更多标签,因此在编码为嵌入向量时距离更近。这些伪标签比直接的相似性匹配更具人类可读性。我们在四个基准集上的评估表明,该方法取得了与近期基线方法相当或更优的结果,同时保持了方法的简洁性与计算高效性。我们的研究结果表明,该方法可应用于低资源场景,并在多种模型与数据集上保持稳定性。