Topic modelling is a prominent task for automatic topic extraction in many applications such as sentiment analysis and recommendation systems. The approach is vital for service industries to monitor their customer discussions. The use of traditional approaches such as Latent Dirichlet Allocation (LDA) for topic discovery has shown great performances, however, they are not consistent in their results as these approaches suffer from data sparseness and inability to model the word order in a document. Thus, this study presents the use of Kernel Principal Component Analysis (KernelPCA) and K-means Clustering in the BERTopic architecture. We have prepared a new dataset using tweets from customers of Nigerian banks and we use this to compare the topic modelling approaches. Our findings showed KernelPCA and K-means in the BERTopic architecture-produced coherent topics with a coherence score of 0.8463.
翻译:主题建模是自动提取主题的关键任务,广泛应用于情感分析与推荐系统等领域。该方法对服务行业监测客户讨论至关重要。传统方法(如潜在狄利克雷分配(LDA))在主题发现中表现出色,但其结果缺乏一致性,因为这些方法存在数据稀疏性且无法建模文档中的词序信息。因此,本研究提出在BERTopic架构中采用核主成分分析(KernelPCA)与K均值聚类(K-means Clustering)。我们利用尼日利亚银行客户的推文构建了新数据集,并以此比较不同主题建模方法。实验结果表明,基于BERTopic架构的KernelPCA与K均值方法能够生成一致性得分为0.8463的高主题一致性主题。