NLP datasets annotated with human judgments are rife with disagreements between the judges. This is especially true for tasks depending on subjective judgments such as sentiment analysis or offensive language detection. Particularly in these latter cases, the NLP community has come to realize that the approach of 'reconciling' these different subjective interpretations is inappropriate. Many NLP researchers have therefore concluded that rather than eliminating disagreements from annotated corpora, we should preserve them-indeed, some argue that corpora should aim to preserve all annotator judgments. But this approach to corpus creation for NLP has not yet been widely accepted. The objective of the LeWiDi series of shared tasks is to promote this approach to developing NLP models by providing a unified framework for training and evaluating with such datasets. We report on the second LeWiDi shared task, which differs from the first edition in three crucial respects: (i) it focuses entirely on NLP, instead of both NLP and computer vision tasks in its first edition; (ii) it focuses on subjective tasks, instead of covering different types of disagreements-as training with aggregated labels for subjective NLP tasks is a particularly obvious misrepresentation of the data; and (iii) for the evaluation, we concentrate on soft approaches to evaluation. This second edition of LeWiDi attracted a wide array of participants resulting in 13 shared task submission papers.
翻译:使用人工判断标注的NLP数据集中,标注者之间存在大量分歧。这一现象在依赖主观判断的任务(如情感分析或攻击性语言检测)中尤为显著。尤其是在后者中,NLP学界逐渐认识到,"调和"这些不同的主观解释是不恰当的。因此,许多NLP研究者得出结论:与其消除标注语料库中的分歧,不如保留分歧——事实上,部分学者主张语料库应致力于保留所有标注者的判断。然而,这种NLP语料库创建方法尚未得到广泛认可。LeWiDi系列共享任务旨在通过提供统一的训练与评估框架,推广基于此类数据集开发NLP模型的方法。本文报告了第二届LeWiDi共享任务,与首届相比,该届任务在三个关键方面有所区别:(i) 完全聚焦于NLP领域,而非首届兼顾NLP与计算机视觉任务;(ii) 专注主观性任务,而非涵盖不同类型的分歧——因为在主观NLP任务中使用聚合标签进行训练尤其容易导致对数据的明显曲解;(iii) 在评估中,我们侧重于软评估方法。本届LeWiDi吸引了广泛参与者,共提交了13篇任务共享论文。