Cross-lingual topic modeling aims to uncover shared semantic themes across languages. Several methods have been proposed to address this problem, leveraging both traditional and neural approaches. While previous methods have achieved some improvements in topic diversity, they often struggle to ensure high topic coherence and consistent alignment across languages. We propose XTRA (Cross-Lingual Topic Modeling with Topic and Representation Alignments), a novel framework that unifies Bag-of-Words modeling with multilingual embeddings. XTRA introduces two core components: (1) representation alignment, aligning document-topic distributions via contrastive learning in a shared semantic space; and (2) topic alignment, projecting topic-word distributions into the same space to enforce crosslingual consistency. This dual mechanism enables XTRA to learn topics that are interpretable (coherent and diverse) and well-aligned across languages. Experiments on multilingual corpora confirm that XTRA significantly outperforms strong baselines in topic coherence, diversity, and alignment quality. Code and reproducible scripts are available at https: //github.com/tienphat140205/XTRA.
翻译:跨语言主题建模旨在揭示跨语言的共享语义主题。已有多种方法通过传统与神经网络方法解决此问题。尽管先前方法在主题多样性方面取得了一定改进,但往往难以同时保证高主题连贯性与跨语言一致性对齐。本文提出XTRA(基于主题与表征对齐的跨语言主题建模),这是一个将词袋建模与多语言嵌入相统一的新型框架。XTRA包含两个核心组件:(1)表征对齐:通过共享语义空间中的对比学习对齐文档-主题分布;(2)主题对齐:将主题-词分布投影至同一空间以增强跨语言一致性。这种双重机制使XTRA能够学习兼具可解释性(连贯且多样)与跨语言对齐性的主题。在多语言语料库上的实验表明,XTRA在主题连贯性、多样性和对齐质量方面显著优于现有基线方法。代码与可复现脚本详见 https: //github.com/tienphat140205/XTRA。