Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering

Retriever-augmented instruction-following models are attractive alternatives to fine-tuned approaches for information-seeking tasks such as question answering (QA). By simply prepending retrieved documents in its input along with an instruction, these models can be adapted to various information domains and tasks without additional fine-tuning. While the model responses tend to be natural and fluent, the additional verbosity makes traditional QA evaluation metrics such as exact match (EM) and F1 unreliable for accurately quantifying model performance. In this work, we investigate the performance of instruction-following models across three information-seeking QA tasks. We use both automatic and human evaluation to evaluate these models along two dimensions: 1) how well they satisfy the user's information need (correctness), and 2) whether they produce a response based on the provided knowledge (faithfulness). Guided by human evaluation and analysis, we highlight the shortcomings of traditional metrics for both correctness and faithfulness. We then propose simple token-overlap based and model-based metrics that reflect the true performance of these models. Our analysis reveals that instruction-following models are competitive, and sometimes even outperform fine-tuned models for correctness. However, these models struggle to stick to the provided knowledge and often hallucinate in their responses. We hope our work encourages a more holistic evaluation of instruction-following models for QA. Our code and data is available at https://github.com/McGill-NLP/instruct-qa

翻译：检索增强的指令遵循模型作为微调方法的有吸引力的替代方案，可用于信息寻求任务（如问答）。通过在输入中简单预置检索到的文档和指令，这些模型无需额外微调即可适应不同信息领域和任务。尽管模型回复通常自然流畅，但额外的冗长性使得传统的问答评估指标（如精确匹配和F1）在准确量化模型性能方面变得不可靠。在这项工作中，我们研究了指令遵循模型在三个信息寻求问答任务上的表现。我们使用自动评估和人工评估从两个维度衡量这些模型：1）它们满足用户信息需求的程度（准确性），以及2）它们是否基于提供的知识生成回复（忠实性）。在人工评估和分析的指导下，我们指出了传统指标在准确性和忠实性方面的不足。随后，我们提出了简单的基于词重叠和基于模型的指标，以反映这些模型的真实性能。我们的分析表明，指令遵循模型在准确性方面具有竞争力，有时甚至优于微调模型。然而，这些模型难以严格遵循提供的知识，并且常常在回复中产生幻觉。我们希望我们的工作能促进对指令遵循模型在问答中的更全面评估。我们的代码和数据可在 https://github.com/McGill-NLP/instruct-qa 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/