As the amount of text data generated by humans and machines increases, the necessity of understanding large corpora and finding a way to extract insights from them is becoming more crucial than ever. Dynamic topic models are effective methods that primarily focus on studying the evolution of topics present in a collection of documents. These models are widely used for understanding trends, exploring public opinion in social networks, or tracking research progress and discoveries in scientific archives. Since topics are defined as clusters of semantically similar documents, it is necessary to observe the changes in the content or themes of these clusters in order to understand how topics evolve as new knowledge is discovered over time. In this paper, we introduce the Aligned Neural Topic Model (ANTM), a dynamic neural topic model that uses document embeddings to compute clusters of semantically similar documents at different periods and to align document clusters to represent their evolution. This alignment procedure preserves the temporal similarity of document clusters over time and captures the semantic change of words characterized by their context within different periods. Experiments on four different datasets show that ANTM outperforms probabilistic dynamic topic models (e.g. DTM, DETM) and significantly improves topic coherence and diversity over other existing dynamic neural topic models (e.g. BERTopic).
翻译:随着人类和机器生成的文本数据量不断增加,理解大规模语料库并从中提取洞见的需求变得比以往更加关键。动态主题模型是主要关注文档集合中主题演化研究的有效方法。这些模型广泛应用于理解趋势、分析社交网络中的公众舆论,或追踪科学档案中的研究进展与发现。由于主题被定义为语义相似文档的聚类,为了理解主题如何随着新知识随时间被发现而演化,有必要观察这些聚类内容或主题的变化。在本文中,我们提出了对齐神经主题模型(ANTM),这是一种动态神经主题模型,它利用文档嵌入在不同时期计算语义相似文档的聚类,并对齐这些文档聚类以表示其演化过程。这种对齐过程保持了文档聚类在时间上的相似性,并捕捉了由不同时期上下文特征化词语的语义变化。在四个不同数据集上的实验表明,ANTM优于概率动态主题模型(如DTM、DETM),并在主题一致性和多样性上显著超越了其他现有的动态神经主题模型(如BERTopic)。