Topic modeling is a key component in unsupervised learning, employed to identify topics within a corpus of textual data. The rapid growth of social media generates an ever-growing volume of textual data daily, making online topic modeling methods essential for managing these data streams that continuously arrive over time. This paper introduces a novel approach to online topic modeling named StreamETM. This approach builds on the Embedded Topic Model (ETM) to handle data streams by merging models learned on consecutive partial document batches using unbalanced optimal transport. Additionally, an online change point detection algorithm is employed to identify shifts in topics over time, enabling the identification of significant changes in the dynamics of text streams. Numerical experiments on simulated and real-world data show StreamETM outperforming competitors. We provide the code publicly available at https://github.com/fgranese/StreamETM.
翻译:主题建模是无监督学习中的关键组成部分,用于识别文本数据语料库中的主题。社交媒体的快速增长每天产生不断增长的文本数据量,使得在线主题建模方法对于管理随时间持续到达的这些数据流至关重要。本文提出了一种名为StreamETM的新型在线主题建模方法。该方法基于嵌入式主题模型(ETM),通过使用非平衡最优传输合并从连续部分文档批次中学习到的模型来处理数据流。此外,采用在线变点检测算法来识别主题随时间的变化,从而能够识别文本流动态中的显著变化。在模拟和真实数据上的数值实验表明,StreamETM优于竞争对手。我们在https://github.com/fgranese/StreamETM公开提供了代码。