Topic modelling is fundamentally a soft clustering problem (of known objects -- documents, over unknown clusters -- topics). That is, the task is incorrectly posed. In particular, the topic models are unstable and incomplete. All this leads to the fact that the process of finding a good topic model (repeated hyperparameter selection, model training, and topic quality assessment) can be particularly long and labor-intensive. We aim to simplify the process, to make it more deterministic and provable. To this end, we present a method for iterative training of a topic model. The essence of the method is that a series of related topic models are trained so that each subsequent model is at least as good as the previous one, i.e., that it retains all the good topics found earlier. The connection between the models is achieved by additive regularization. The result of this iterative training is the last topic model in the series, which we call the iteratively updated additively regularized topic model (ITAR). Experiments conducted on several collections of natural language texts show that the proposed ITAR model performs better than other popular topic models (LDA, ARTM, BERTopic), its topics are diverse, and its perplexity (ability to "explain" the underlying data) is moderate.
翻译:主题建模本质上是一个软聚类问题(对已知对象——文档,在未知聚类——主题上进行)。也就是说,该任务的提法本身是不准确的。具体而言,主题模型具有不稳定性和不完整性。所有这些导致寻找一个好的主题模型的过程(重复的超参数选择、模型训练和主题质量评估)可能特别漫长且费力。我们的目标是简化这一过程,使其更具确定性和可证明性。为此,我们提出了一种迭代训练主题模型的方法。该方法的本质在于训练一系列相关的主题模型,使得每个后续模型至少与前一个模型一样好,即保留所有先前发现的好主题。模型之间的关联通过加性正则化实现。这种迭代训练的结果是系列中的最后一个主题模型,我们称之为迭代更新的加性正则化主题模型(ITAR)。在多个自然语言文本集上进行的实验表明,所提出的ITAR模型优于其他流行的主题模型(LDA、ARTM、BERTopic),其主题具有多样性,且其困惑度(“解释”底层数据的能力)适中。