Topic modeling is pivotal in discerning hidden semantic structures within texts, thereby generating meaningful descriptive keywords. While innovative techniques like BERTopic and Top2Vec have recently emerged in the forefront, they manifest certain limitations. Our analysis indicates that these methods might not prioritize the refinement of their clustering mechanism, potentially compromising the quality of derived topic clusters. To illustrate, Top2Vec designates the centroids of clustering results to represent topics, whereas BERTopic harnesses C-TF-IDF for its topic extraction.In response to these challenges, we introduce "TF-RDF" (Term Frequency - Relative Document Frequency), a distinctive approach to assess the relevance of terms within a document. Building on the strengths of TF-RDF, we present MPTopic, a clustering algorithm intrinsically driven by the insights of TF-RDF. Through comprehensive evaluation, it is evident that the topic keywords identified with the synergy of MPTopic and TF-RDF outperform those extracted by both BERTopic and Top2Vec.
翻译:主题建模是揭示文本中隐藏语义结构并生成有意义描述性关键词的关键技术。尽管BERTopic和Top2Vec等创新方法近年来成为研究前沿,但它们仍存在一定局限性。我们的分析表明,这些方法可能未优先优化其聚类机制,从而可能影响生成主题簇的质量。例如,Top2Vec将聚类结果的质心定义为主题代表,而BERTopic则采用C-TF-IDF进行主题提取。针对这些问题,我们提出"TF-RDF"(词频-相对文档频率),一种评估文档内术语相关性的独特方法。基于TF-RDF的优势,我们进一步提出MPTopic——一种由TF-RDF内在见解驱动的聚类算法。通过全面评估,MPTopic与TF-RDF协同识别出的主题关键词,其性能显著优于BERTopic和Top2Vec所提取的结果。