Many researchers have reached the conclusion that AI models should be trained to be aware of the possibility of variation and disagreement in human judgments, and evaluated as per their ability to recognize such variation. The LEWIDI series of shared tasks on Learning With Disagreements was established to promote this approach to training and evaluating AI models, by making suitable datasets more accessible and by developing evaluation methods. The third edition of the task builds on this goal by extending the LEWIDI benchmark to four datasets spanning paraphrase identification, irony detection, sarcasm detection, and natural language inference, with labeling schemes that include not only categorical judgments as in previous editions, but ordinal judgments as well. Another novelty is that we adopt two complementary paradigms to evaluate disagreement-aware systems: the soft-label approach, in which models predict population-level distributions of judgments, and the perspectivist approach, in which models predict the interpretations of individual annotators. Crucially, we moved beyond standard metrics such as cross-entropy, and tested new evaluation metrics for the two paradigms. The task attracted diverse participation, and the results provide insights into the strengths and limitations of methods to modeling variation. Together, these contributions strengthen LEWIDI as a framework and provide new resources, benchmarks, and findings to support the development of disagreement-aware technologies.
翻译:许多研究者已达成共识,即人工智能模型应被训练以认识到人类判断可能存在差异和分歧,并依据其识别此类差异的能力进行评估。“学习与分歧”(LeWiDi)系列共享任务正是为了推广这种人工智能模型训练与评估方法而设立,旨在使合适的数据集更易于获取,并开发相应的评估方法。第三届任务在此目标基础上,将 LeWiDi 基准扩展至涵盖复述识别、反讽检测、讽刺检测和自然语言推理的四个数据集,其标注方案不仅包括前几届已有的分类判断,还引入了序数判断。另一项创新在于,我们采用了两种互补的范式来评估具备分歧感知能力的系统:一是软标签方法,即模型预测群体层面的判断分布;二是视角主义方法,即模型预测个体标注者的解读。关键在于,我们超越了交叉熵等标准指标,为这两种范式测试了新的评估指标。本次任务吸引了多样化的参与,其结果为了解建模差异方法的优势与局限性提供了洞见。这些贡献共同巩固了 LeWiDi 作为一个框架的地位,并为支持分歧感知技术的发展提供了新的资源、基准和发现。