In this paper, we propose a training-free method for unsupervised short text clustering that relies less on careful selection of embedders than other methods. In customer-facing chatbots, companies are dealing with large amounts of user utterances that need to be clustered according to their intent. In these settings, no labeled data is typically available, and the number of clusters is not known. Recent approaches to short-text clustering in label-free settings incorporate LLM output to refine existing embeddings. While LLMs can identify similar texts effectively, the resulting similarities may not be directly represented by distances in the dense vector space, as they depend on the original embedding. We therefore propose a method for transforming LLM judgments directly into a bag-of-texts representation in which texts are initialized to be equidistant, without assuming any prior distance relationships. Our method achieves comparable or superior results to state-of-the-art methods, but without embeddings optimization or assuming prior knowledge of clusters or labels. Experiments on diverse datasets and smaller LLMs show that our method is model agnostic and can be applied to any embedder, with relatively small LLMs, and different clustering methods. We also show how our method scales to large datasets, reducing the computational cost of the LLM use. The flexibility and scalability of our method make it more aligned with real-world training-free scenarios than existing clustering methods.
翻译:本文提出一种无需训练的无监督短文本聚类方法,该方法相比其他方法对嵌入器选择的依赖更小。在面向客户的聊天机器人场景中,企业需要处理大量用户话语,并根据其意图进行聚类。此类场景通常缺乏标注数据,且聚类数量未知。近期在无标签场景下的短文本聚类方法通过整合LLM输出来优化现有嵌入表示。虽然LLM能有效识别相似文本,但由于其依赖于原始嵌入,所得相似度可能无法直接通过稠密向量空间中的距离来表征。因此,我们提出一种将LLM判断直接转换为文本袋表示的方法:该方法将文本初始化为等距分布,无需预设任何先验距离关系。我们的方法在无需优化嵌入、无需预设聚类数量或标签知识的条件下,取得了与最先进方法相当或更优的结果。在不同数据集和较小规模LLM上的实验表明,该方法具有模型无关性,可适配任意嵌入器、相对小型的LLM以及不同的聚类方法。我们还验证了该方法可扩展至大规模数据集,从而降低LLM使用的计算成本。本方法在灵活性和可扩展性方面的优势,使其比现有聚类方法更符合实际应用中无需训练的场景需求。