Clustering is a fundamental tool that has garnered significant interest across a wide range of applications including text analysis. To improve clustering accuracy, many researchers have incorporated background knowledge, typically in the form of must-link and cannot-link constraints, to guide the clustering process. With the recent advent of large language models (LLMs), there is growing interest in improving clustering quality through LLM-based automatic constraint generation. In this paper, we propose a novel constraint-generation approach that reduces resource consumption by generating constraint sets rather than using traditional pairwise constraints. This approach improves both query efficiency and constraint accuracy compared to state-of-the-art methods. We further introduce a constrained clustering algorithm tailored to the characteristics of LLM-generated constraints. Our method incorporates a confidence threshold and a penalty mechanism to address potentially inaccurate constraints. We evaluate our approach on five text datasets, considering both the cost of constraint generation and the overall clustering performance. The results show that our method achieves clustering accuracy comparable to the state-of-the-art algorithms while reducing the number of LLM queries by more than 20 times.
翻译:聚类是一种基础工具,在文本分析等广泛的应用领域中引起了极大关注。为提高聚类精度,许多研究者引入背景知识(通常以必须链接和禁止链接约束的形式)来指导聚类过程。随着大语言模型(LLMs)的最新发展,通过基于LLM的自动约束生成来提升聚类质量的研究日益增多。本文提出了一种新颖的约束生成方法,通过生成约束集而非使用传统的成对约束来降低资源消耗。与现有先进方法相比,该方法在查询效率和约束准确性方面均有提升。我们进一步提出了一种针对LLM生成约束特性设计的约束聚类算法。该方法通过引入置信度阈值和惩罚机制来处理可能不准确的约束。我们在五个文本数据集上评估了所提方法,综合考虑了约束生成成本和整体聚类性能。实验结果表明,我们的方法在实现与先进算法相当的聚类精度的同时,将LLM查询次数降低了20倍以上。