Recent works in clustering-based topic models perform well in monolingual topic identification by introducing a pipeline to cluster the contextualized representations. However, the pipeline is suboptimal in identifying topics across languages due to the presence of language-dependent dimensions (LDDs) generated by multilingual language models. To address this issue, we introduce a novel, SVD-based dimension refinement component into the pipeline of the clustering-based topic model. This component effectively neutralizes the negative impact of LDDs, enabling the model to accurately identify topics across languages. Our experiments on three datasets demonstrate that the updated pipeline with the dimension refinement component generally outperforms other state-of-the-art cross-lingual topic models.
翻译:近期基于聚类的主题模型研究通过引入上下文表征聚类流程,在单语主题识别任务中表现优异。然而,由于多语言模型生成的语言依赖维度存在,该流程在跨语言主题识别中存在次优问题。为解决此问题,我们在基于聚类的主题模型流程中引入了一种新颖的基于奇异值分解的维度优化组件。该组件能有效消除语言依赖维度的负面影响,使模型能够准确识别跨语言主题。我们在三个数据集上的实验表明,配备维度优化组件的改进流程整体优于其他最先进的跨语言主题模型。