NLP pipelines with limited or no labeled data, rely on unsupervised methods for document processing. Unsupervised approaches typically depend on clustering of terms or documents. In this paper, we introduce a novel clustering algorithm, Vec2GC (Vector to Graph Communities), an end-to-end pipeline to cluster terms or documents for any given text corpus. Our method uses community detection on a weighted graph of the terms or documents, created using text representation learning. Vec2GC clustering algorithm is a density based approach, that supports hierarchical clustering as well.
翻译:在缺乏或仅有少量标注数据的自然语言处理管道中,文档处理通常依赖无监督方法。无监督方法一般基于词项或文档的聚类。本文提出一种新型聚类算法Vec2GC(向量到图社区),这是一个端到端的管道,可对任意给定文本语料库中的词项或文档进行聚类。该方法通过对基于文本表示学习构建的词项或文档加权图进行社区检测,从而实现聚类功能。Vec2GC聚类算法是一种密度驱动的方法,同时支持层次化聚类。