This paper presents the results of the SHROOM, a shared task focused on detecting hallucinations: outputs from natural language generation (NLG) systems that are fluent, yet inaccurate. Such cases of overgeneration put in jeopardy many NLG applications, where correctness is often mission-critical. The shared task was conducted with a newly constructed dataset of 4000 model outputs labeled by 5 annotators each, spanning 3 NLP tasks: machine translation, paraphrase generation and definition modeling. The shared task was tackled by a total of 58 different users grouped in 42 teams, out of which 27 elected to write a system description paper; collectively, they submitted over 300 prediction sets on both tracks of the shared task. We observe a number of key trends in how this approach was tackled -- many participants rely on a handful of model, and often rely either on synthetic data for fine-tuning or zero-shot prompting strategies. While a majority of the teams did outperform our proposed baseline system, the performances of top-scoring systems are still consistent with a random handling of the more challenging items.
翻译:本文介绍了SHROOM共享任务的结果,该任务聚焦于检测幻觉:自然语言生成(NLG)系统输出的内容虽然流畅但不准确。此类过度生成问题危及众多NLG应用场景,其中正确性往往具有关键任务属性。本共享任务采用新构建的数据集开展,该数据集包含4000个模型输出,每个输出由5名标注员标注,涵盖三项NLP任务:机器翻译、释义生成和定义建模。共有58位不同用户组成42个团队参与本共享任务,其中27个团队选择撰写系统描述论文;他们在本共享任务的两个轨道上合计提交了超过300组预测集。我们观察到若干关键趋势:多数参与者依赖少数几类模型,常采用合成数据进行微调或零样本提示策略。尽管大多数团队的表现优于我们提出的基线系统,但得分最高系统的性能在处理更具挑战性的项目时仍与随机处理结果相当。