In this work, we dive deep into one of the popular knowledge-grounded dialogue benchmarks that focus on faithfulness, FaithDial. We show that a significant portion of the FaithDial data contains annotation artifacts, which may bias models towards completely ignoring the conversation history. We therefore introduce CHARP, a diagnostic test set, designed for an improved evaluation of hallucinations in conversational model. CHARP not only measures hallucination but also the compliance of the models to the conversation task. Our extensive analysis reveals that models primarily exhibit poor performance on CHARP due to their inability to effectively attend to and reason over the conversation history. Furthermore, the evaluation methods of FaithDial fail to capture these shortcomings, neglecting the conversational history. Our findings indicate that there is substantial room for contribution in both dataset creation and hallucination evaluation for knowledge-grounded dialogue, and that CHARP can serve as a tool for monitoring the progress in this particular research area. CHARP is publicly available at https://huggingface.co/datasets/huawei-noah/CHARP
翻译:在本研究中,我们深入探讨了专注于忠实度的流行知识对话基准之一——FaithDial。我们发现FaithDial数据中存在显著比例的标注伪影,这可能导致模型完全忽略对话历史。为此,我们提出了CHARP——一个诊断测试集,旨在改进对话模型幻觉的评估。CHARP不仅衡量幻觉现象,同时评估模型对对话任务的遵循程度。我们广泛的分析表明,模型在CHARP上表现不佳的主要原因是其无法有效关注并推理对话历史。此外,FaithDial的评估方法未能捕捉这些缺陷,忽视了对话历史的作用。我们的研究结果表明,在知识对话的数据集构建和幻觉评估方面仍存在广阔的改进空间,而CHARP可作为监测该特定研究领域进展的有效工具。CHARP已公开于https://huggingface.co/datasets/huawei-noah/CHARP。