Topic models and all their variants analyse text by learning meaningful representations through word co-occurrences. As pointed out by Williamson et al. (2010), such models implicitly assume that the probability of a topic to be active and its proportion within each document are positively correlated. This correlation can be strongly detrimental in the case of documents created over time, simply because recent documents are likely better described by new and hence rare topics. In this work we leverage recent advances in neural variational inference and present an alternative neural approach to the dynamic Focused Topic Model. Indeed, we develop a neural model for topic evolution which exploits sequences of Bernoulli random variables in order to track the appearances of topics, thereby decoupling their activities from their proportions. We evaluate our model on three different datasets (the UN general debates, the collection of NeurIPS papers, and the ACL Anthology dataset) and show that it (i) outperforms state-of-the-art topic models in generalization tasks and (ii) performs comparably to them on prediction tasks, while employing roughly the same number of parameters, and converging about two times faster. Source code to reproduce our experiments is available online.
翻译:主题模型及其各类变体通过词共现学习有意义的表征以分析文本。正如Williamson等人(2010年)所指出的,此类模型隐含地假设主题被激活的概率与其在每篇文档中的比例呈正相关。这种相关性对于随时间生成的文档可能产生严重不利影响,原因很简单:近期文档往往能更好地用新兴且罕见的主题来描述。本文利用神经变分推断的最新进展,提出了一种替代动态聚焦主题模型的神经方法。具体而言,我们开发了一种用于主题演化的神经模型,该模型利用伯努利随机变量序列来追踪主题的出现,从而将主题的激活状态与其比例解耦。我们在三个不同数据集(联合国一般性辩论数据集、NeurIPS论文集合及ACL Anthology数据集)上评估了该模型,结果表明:(i)在泛化任务上优于最先进的主题模型;(ii)在预测任务上与其表现相当,同时使用大致相同的参数数量,且收敛速度快约两倍。可复现实验的源代码已在网上公开。