Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering

Retriever-augmented instruction-following models are attractive alternatives to fine-tuned approaches for information-seeking tasks such as question answering (QA). By simply prepending retrieved documents in its input along with an instruction, these models can be adapted to various information domains and tasks without additional fine-tuning. While the model responses tend to be natural and fluent, the additional verbosity makes traditional QA evaluation metrics such as exact match (EM) and F1 unreliable for accurately quantifying model performance. In this work, we investigate the performance of instruction-following models across three information-seeking QA tasks. We use both automatic and human evaluation to evaluate these models along two dimensions: 1) how well they satisfy the user's information need (correctness), and 2) whether they produce a response based on the provided knowledge (faithfulness). Guided by human evaluation and analysis, we highlight the shortcomings of traditional metrics for both correctness and faithfulness. We then propose simple token-overlap based and model-based metrics that reflect the true performance of these models. Our analysis reveals that instruction-following models are competitive, and sometimes even outperform fine-tuned models for correctness. However, these models struggle to stick to the provided knowledge and often hallucinate in their responses. We hope our work encourages a more holistic evaluation of instruction-following models for QA. Our code and data is available at https://github.com/McGill-NLP/instruct-qa

翻译：检索增强型指令遵循模型作为微调方法的替代方案，在问答等信息检索任务中颇具吸引力。通过在输入中简洁地拼接检索到的文档及指令，这些模型无需额外微调即可适应多种信息领域与任务。尽管模型回答通常自然流畅，但其附加的冗长性使得传统问答评估指标（如精确匹配和F1分数）难以准确量化模型性能。本研究探究了指令遵循模型在三个信息检索型问答任务中的表现。我们采用自动评估与人工评估两种方式，从两个维度对模型进行评价：1）满足用户信息需求的程度（正确性）；2）回答是否基于提供的知识（忠实性）。基于人工评估与分析，我们揭示了传统指标在评估正确性与忠实性方面的缺陷，进而提出基于简单词元重叠及基于模型的指标，以反映模型的真实性能。分析表明，指令遵循模型在正确性方面具有竞争力，有时甚至优于微调模型，但这些模型难以严格遵循所提供的知识，且常在回答中产生幻觉。我们希望本研究能促进对问答任务中指令遵循模型展开更全面的评估。我们的代码与数据已公开于 https://github.com/McGill-NLP/instruct-qa。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/