This work introduces EUvsDisinfo, a multilingual dataset of trustworthy and disinformation articles related to pro-Kremlin themes. It is sourced directly from the debunk articles written by experts leading the EUvsDisinfo project. Our dataset is the largest to-date resource in terms of the overall number of articles and distinct languages. It also provides the largest topical and temporal coverage. Using this dataset, we investigate the dissemination of pro-Kremlin disinformation across different languages, uncovering language-specific patterns targeting specific disinformation topics. We further analyse the evolution of topic distribution over an eight-year period, noting a significant surge in disinformation content before the full-scale invasion of Ukraine in 2022. Lastly, we demonstrate the dataset's applicability in training models to effectively distinguish between disinformation and trustworthy content in multilingual settings.
翻译:本研究介绍了EUvsDisinfo,一个与亲克里姆林宫主题相关的可信新闻与虚假新闻的多语言数据集。其数据直接来源于EUvsDisinfo项目专家撰写的辟谣文章。就文章总数与涉及语言种类而言,本数据集是迄今规模最大的资源,同时提供了最广泛的主题与时间跨度覆盖。利用该数据集,我们研究了亲克里姆林宫虚假信息在不同语言间的传播模式,揭示了针对特定虚假信息主题的语言特异性规律。我们进一步分析了八年期间主题分布的演变趋势,注意到在2022年乌克兰全面遭受入侵前,虚假信息内容出现显著激增。最后,我们验证了该数据集在训练模型以有效区分多语言环境下虚假信息与可信内容方面的适用性。