Text clustering is an important approach for organising the growing amount of digital content, helping to structure and find hidden patterns in uncategorised data. In this research, we investigated how different textual embeddings - particularly those used in large language models (LLMs) - and clustering algorithms affect how text datasets are clustered. A series of experiments were conducted to assess how embeddings influence clustering results, the role played by dimensionality reduction through summarisation, and embedding size adjustment. Results reveal that LLM embeddings excel at capturing the nuances of structured language, while BERT leads the lightweight options in performance. In addition, we find that increasing embedding dimensionality and summarisation techniques do not uniformly improve clustering efficiency, suggesting that these strategies require careful analysis to use in real-life models. These results highlight a complex balance between the need for nuanced text representation and computational feasibility in text clustering applications. This study extends traditional text clustering frameworks by incorporating embeddings from LLMs, thereby paving the way for improved methodologies and opening new avenues for future research in various types of textual analysis.
翻译:文本聚类是组织日益增长的数字化内容的重要方法,有助于对未分类数据进行结构化并发现隐藏模式。本研究探讨了不同文本嵌入(尤其是大型语言模型使用的嵌入)及聚类算法如何影响文本数据集的聚类效果。通过系列实验评估了嵌入对聚类结果的影响、通过摘要进行降维的作用以及嵌入维度调整的效果。结果表明,大语言模型嵌入在捕捉结构化语言的细微差异方面表现优异,而轻量级方案中BERT性能领先。此外,研究发现增加嵌入维度与摘要技术并未一致提升聚类效率,这表明这些策略在实际模型应用中需谨慎分析。研究结果揭示了文本聚类应用中,精细文本表征需求与计算可行性之间的复杂平衡关系。本研究通过整合大语言模型的嵌入拓展了传统文本聚类框架,为改进方法论奠定基础,并为各类文本分析的未来研究开辟新方向。