CrossNews-UA：面向乌克兰语、波兰语、俄语与英语的跨语言新闻语义相似性基准 (CrossNews-UA: A Cross-lingual News Semantic Similarity Benchmark for Ukrainian, Polish, Russian, and English)

In the era of social networks and rapid misinformation spread, news analysis remains a critical task. Detecting fake news across multiple languages, particularly beyond English, poses significant challenges. Cross-lingual news comparison offers a promising approach to verify information by leveraging external sources in different languages (Chen and Shu, 2024). However, existing datasets for cross-lingual news analysis (Chen et al., 2022a) were manually curated by journalists and experts, limiting their scalability and adaptability to new languages. In this work, we address this gap by introducing a scalable, explainable crowdsourcing pipeline for cross-lingual news similarity assessment. Using this pipeline, we collected a novel dataset CrossNews-UA of news pairs in Ukrainian as a central language with linguistically and contextually relevant languages-Polish, Russian, and English. Each news pair is annotated for semantic similarity with detailed justifications based on the 4W criteria (Who, What, Where, When). We further tested a range of models, from traditional bag-of-words, Transformer-based architectures to large language models (LLMs). Our results highlight the challenges in multilingual news analysis and offer insights into models performance.

翻译：在社交媒体与虚假信息快速传播的时代，新闻分析仍是一项关键任务。跨语言（尤其是英语以外语种）的虚假新闻检测面临重大挑战。跨语言新闻比对通过利用不同语言的外部信源验证信息，提供了一种前景广阔的方法（Chen and Shu, 2024）。然而，现有跨语言新闻分析数据集（Chen et al., 2022a）均由记者与专家人工编制，其可扩展性及对新语种的适应性受限。本研究通过构建一个可扩展、可解释的众包流程来评估跨语言新闻相似性，以弥补这一空白。基于该流程，我们收集了以乌克兰语为核心语言，涵盖语言及语境相关语种（波兰语、俄语、英语）的新型新闻对数据集CrossNews-UA。每个新闻对均依据4W准则（何人、何事、何地、何时）进行语义相似性标注，并提供详尽的判定依据。我们进一步测试了从传统词袋模型、基于Transformer的架构到大型语言模型（LLMs）的一系列模型。实验结果揭示了多语言新闻分析中的挑战，并为模型性能评估提供了重要见解。