Basque and Spanish Counter Narrative Generation: Data Creation and Evaluation

Counter Narratives (CNs) are non-negative textual responses to Hate Speech (HS) aiming at defusing online hatred and mitigating its spreading across media. Despite the recent increase in HS content posted online, research on automatic CN generation has been relatively scarce and predominantly focused on English. In this paper, we present CONAN-EUS, a new Basque and Spanish dataset for CN generation developed by means of Machine Translation (MT) and professional post-edition. Being a parallel corpus, also with respect to the original English CONAN, it allows to perform novel research on multilingual and crosslingual automatic generation of CNs. Our experiments on CN generation with mT5, a multilingual encoder-decoder model, show that generation greatly benefits from training on post-edited data, as opposed to relying on silver MT data only. These results are confirmed by their correlation with a qualitative manual evaluation, demonstrating that manually revised training data remains crucial for the quality of the generated CNs. Furthermore, multilingual data augmentation improves results over monolingual settings for structurally similar languages such as English and Spanish, while being detrimental for Basque, a language isolate. Similar findings occur in zero-shot crosslingual evaluations, where model transfer (fine-tuning in English and generating in a different target language) outperforms fine-tuning mT5 on machine translated data for Spanish but not for Basque. This provides an interesting insight into the asymmetry in the multilinguality of generative models, a challenging topic which is still open to research.

翻译：反叙事（CNs）是针对仇恨言论（HS）的非负面文本回应，旨在消解网络仇恨并减缓其在媒体中的传播。尽管近年来在线发布的仇恨言论内容有所增加，但关于自动反叙事生成的研究相对匮乏，且主要集中于英语。本文提出了CONAN-EUS，一个通过机器翻译（MT）和专业后编辑开发的巴斯克语及西班牙语反叙事生成新数据集。作为与原英语CONAN平行的语料库，该数据集支持在多语言和跨语言自动反叙事生成领域开展创新研究。我们基于多语言编码器-解码器模型mT5进行的反叙事生成实验表明，相比于仅依赖银标准机器翻译数据，使用后编辑数据进行训练能显著提升生成质量。这一结果通过定性人工评估的相关性得到证实，说明人工修订的训练数据对生成反叙事的质量仍至关重要。此外，多语言数据增强能提升结构相似语言（如英语与西班牙语）的单语设置效果，但对孤立语言巴斯克语产生负面影响。在零样本跨语言评估中亦观察到类似发现：模型迁移（以英语微调后在目标语言生成）对西班牙语的表现优于使用机器翻译数据微调mT5，但对巴斯克语则不然。这为生成模型多语言性的非对称性提供了有趣见解——这一颇具挑战的课题仍有待研究。

相关内容

中国神经科学学会

关注 0

中国神经科学学会（CNS）是由全国的科研、教学和医院等单位中的神经科学工作者组成的，具有独立法人资格的非营利性社会团体。自2016年起，学会开始致力于神经科学学科引领和学术战略规划。2016-2018年完成了中国科协《神经科学方向预测与技术路线图》项目和《生命科学领域前沿跟踪研究》项目，并且已经由科学出版社正式出版，2020年完成了《神经科学和类脑人工智能发展-新进展新趋势》。2020-2021年还将完成《我国类脑智能产业与技术发展路线图研究》和《科技经济融合发展-智能细胞制造科技创新与产业发展战略研究》。2020年开始学会将每年开展评选年度“中国神经科学重大进展”。中国神经科学学会年会即全国学术会议，是我国神经科学领域规模最大、学术水平最高的学术会议。从2021年开始，改为一年一次，并且与海内外华人神经科学家研讨会结合在一起。学会下属专业分会每年召开形式多样、内容丰富的学术会议和培训班，促进了神经科学领域的学术交流和合作。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日