Research on Multilingual News Clustering Based on Cross-Language Word Embeddings

Classifying the same event reported by different countries is of significant importance for public opinion control and intelligence gathering. Due to the diverse types of news, relying solely on transla-tors would be costly and inefficient, while depending solely on translation systems would incur considerable performance overheads in invoking translation interfaces and storing translated texts. To address this issue, we mainly focus on the clustering problem of cross-lingual news. To be specific, we use a combination of sentence vector representations of news headlines in a mixed semantic space and the topic probability distributions of news content to represent a news article. In the training of cross-lingual models, we employ knowledge distillation techniques to fit two semantic spaces into a mixed semantic space. We abandon traditional static clustering methods like K-Means and AGNES in favor of the incremental clustering algorithm Single-Pass, which we further modify to better suit cross-lingual news clustering scenarios. Our main contributions are as follows: (1) We adopt the English standard BERT as the teacher model and XLM-Roberta as the student model, training a cross-lingual model through knowledge distillation that can represent sentence-level bilingual texts in both Chinese and English. (2) We use the LDA topic model to represent news as a combina-tion of cross-lingual vectors for headlines and topic probability distributions for con-tent, introducing concepts such as topic similarity to address the cross-lingual issue in news content representation. (3) We adapt the Single-Pass clustering algorithm for the news context to make it more applicable. Our optimizations of Single-Pass include ad-justing the distance algorithm between samples and clusters, adding cluster merging operations, and incorporating a news time parameter.

翻译：不同国家报道的同一事件的分类对于舆论控制和情报收集具有重要意义。由于新闻类型多样，单纯依赖翻译人员成本高昂且效率低下，而仅依靠翻译系统则在调用翻译接口和存储译文时产生显著的性能开销。为解决这一问题，我们主要聚焦跨语言新闻的聚类问题。具体而言，我们采用混合语义空间中新闻标题的句子向量表示与新闻内容的主题概率分布相结合的方式来表达一篇新闻文章。在跨语言模型训练中，我们利用知识蒸馏技术将两个语义空间拟合到混合语义空间。我们摒弃了K-Means和AGNES等传统静态聚类方法，转而采用增量聚类算法Single-Pass，并对其进行改进以更好地适配跨语言新闻聚类场景。我们的主要贡献如下：（1）采用英文标准BERT作为教师模型、XLM-Roberta作为学生模型，通过知识蒸馏训练出能够表示中英双语句子级文本的跨语言模型；（2）使用LDA主题模型将新闻表示为标题跨语言向量与内容主题概率分布的组合，引入主题相似度等概念解决新闻内容表示中的跨语言问题；（3）针对新闻场景改进Single-Pass聚类算法：包括调整样本与簇间的距离算法、增加簇合并操作以及引入新闻时间参数。