Retriever-augmented instruction-following models are attractive alternatives to fine-tuned approaches for information-seeking tasks such as question answering (QA). By simply prepending retrieved documents in its input along with an instruction, these models can be adapted to various information domains and tasks without additional fine-tuning. While the model responses tend to be natural and fluent, the additional verbosity makes traditional QA evaluation metrics such as exact match (EM) and F1 unreliable for accurately quantifying model performance. In this work, we investigate the performance of instruction-following models across three information-seeking QA tasks. We use both automatic and human evaluation to evaluate these models along two dimensions: 1) how well they satisfy the user's information need (correctness), and 2) whether they produce a response based on the provided knowledge (faithfulness). Guided by human evaluation and analysis, we highlight the shortcomings of traditional metrics for both correctness and faithfulness. We then propose simple token-overlap based and model-based metrics that reflect the true performance of these models. Our analysis reveals that instruction-following models are competitive, and sometimes even outperform fine-tuned models for correctness. However, these models struggle to stick to the provided knowledge and often hallucinate in their responses. We hope our work encourages a more holistic evaluation of instruction-following models for QA. Our code and data is available at https://github.com/McGill-NLP/instruct-qa
翻译:检索增强的指令遵循模型作为微调方法的有吸引力的替代方案,可用于信息寻求任务(如问答)。通过在输入中简单预置检索到的文档和指令,这些模型无需额外微调即可适应不同信息领域和任务。尽管模型回复通常自然流畅,但额外的冗长性使得传统的问答评估指标(如精确匹配和F1)在准确量化模型性能方面变得不可靠。在这项工作中,我们研究了指令遵循模型在三个信息寻求问答任务上的表现。我们使用自动评估和人工评估从两个维度衡量这些模型:1)它们满足用户信息需求的程度(准确性),以及2)它们是否基于提供的知识生成回复(忠实性)。在人工评估和分析的指导下,我们指出了传统指标在准确性和忠实性方面的不足。随后,我们提出了简单的基于词重叠和基于模型的指标,以反映这些模型的真实性能。我们的分析表明,指令遵循模型在准确性方面具有竞争力,有时甚至优于微调模型。然而,这些模型难以严格遵循提供的知识,并且常常在回复中产生幻觉。我们希望我们的工作能促进对指令遵循模型在问答中的更全面评估。我们的代码和数据可在 https://github.com/McGill-NLP/instruct-qa 获取。