Recent advances in machine learning, particularly Large Language Models (LLMs) such as BERT and GPT, provide rich contextual embeddings that improve text representation. However, current document clustering approaches often ignore the deeper relationships between named entities (NEs) and the potential of LLM embeddings. This paper proposes a novel approach that integrates Named Entity Recognition (NER) and LLM embeddings within a graph-based framework for document clustering. The method builds a graph with nodes representing documents and edges weighted by named entity similarity, optimized using a graph-convolutional network (GCN). This ensures a more effective grouping of semantically related documents. Experimental results indicate that our approach outperforms conventional co-occurrence-based methods in clustering, notably for documents rich in named entities.
翻译:近期机器学习领域的进展,特别是诸如BERT和GPT等大语言模型(LLMs),提供了丰富的上下文嵌入,从而改进了文本表示。然而,当前的文档聚类方法常常忽略命名实体(NEs)之间更深层的关系以及LLM嵌入的潜力。本文提出了一种新颖的方法,将命名实体识别(NER)和LLM嵌入整合到一个基于图的框架中,用于文档聚类。该方法构建了一个图,其中节点代表文档,边由命名实体相似性加权,并使用图卷积网络(GCN)进行优化。这确保了语义相关文档的更有效分组。实验结果表明,我们的方法在聚类性能上优于传统的基于共现的方法,尤其是在富含命名实体的文档上表现显著。