Cross-lingual topic models have been prevalent for cross-lingual text analysis by revealing aligned latent topics. However, most existing methods suffer from producing repetitive topics that hinder further analysis and performance decline caused by low-coverage dictionaries. In this paper, we propose the Cross-lingual Topic Modeling with Mutual Information (InfoCTM). Instead of the direct alignment in previous work, we propose a topic alignment with mutual information method. This works as a regularization to properly align topics and prevent degenerate topic representations of words, which mitigates the repetitive topic issue. To address the low-coverage dictionary issue, we further propose a cross-lingual vocabulary linking method that finds more linked cross-lingual words for topic alignment beyond the translations of a given dictionary. Extensive experiments on English, Chinese, and Japanese datasets demonstrate that our method outperforms state-of-the-art baselines, producing more coherent, diverse, and well-aligned topics and showing better transferability for cross-lingual classification tasks.
翻译:跨语言主题模型通过揭示对齐的潜在主题,已成为跨语言文本分析的常用方法。然而,现有方法大多存在两个问题:生成重复主题阻碍进一步分析,以及低覆盖词典导致性能下降。本文提出一种基于互信息的跨语言主题建模方法(InfoCTM)。与以往工作中的直接对齐不同,我们提出一种基于互信息的主题对齐方法,该方法通过正则化实现主题的恰当对齐,并防止词的主题表示退化,从而缓解主题重复问题。针对低覆盖词典问题,我们进一步提出一种跨语言词汇链接方法,能够超越给定词典的翻译范围,找到更多用于主题对齐的跨语言关联词。基于英语、汉语和日语数据集的广泛实验表明,我们的方法优于当前最优基线模型,生成的主题更具连贯性、多样性和对齐性,并在跨语言分类任务中展现出更强的迁移能力。