Retriever-augmented instruction-following models are attractive alternatives to fine-tuned approaches for information-seeking tasks such as question answering (QA). By simply prepending retrieved documents in its input along with an instruction, these models can be adapted to various information domains and tasks without additional fine-tuning. While the model responses tend to be natural and fluent, the additional verbosity makes traditional QA evaluation metrics such as exact match (EM) and F1 unreliable for accurately quantifying model performance. In this work, we investigate the performance of instruction-following models across three information-seeking QA tasks. We use both automatic and human evaluation to evaluate these models along two dimensions: 1) how well they satisfy the user's information need (correctness), and 2) whether they produce a response based on the provided knowledge (faithfulness). Guided by human evaluation and analysis, we highlight the shortcomings of traditional metrics for both correctness and faithfulness. We then propose simple token-overlap based and model-based metrics that reflect the true performance of these models. Our analysis reveals that instruction-following models are competitive, and sometimes even outperform fine-tuned models for correctness. However, these models struggle to stick to the provided knowledge and often hallucinate in their responses. We hope our work encourages a more holistic evaluation of instruction-following models for QA. Our code and data is available at https://github.com/McGill-NLP/instruct-qa
翻译:检索增强型指令遵循模型作为微调方法的替代方案,在问答等信息检索任务中颇具吸引力。通过在输入中简洁地拼接检索到的文档及指令,这些模型无需额外微调即可适应多种信息领域与任务。尽管模型回答通常自然流畅,但其附加的冗长性使得传统问答评估指标(如精确匹配和F1分数)难以准确量化模型性能。本研究探究了指令遵循模型在三个信息检索型问答任务中的表现。我们采用自动评估与人工评估两种方式,从两个维度对模型进行评价:1)满足用户信息需求的程度(正确性);2)回答是否基于提供的知识(忠实性)。基于人工评估与分析,我们揭示了传统指标在评估正确性与忠实性方面的缺陷,进而提出基于简单词元重叠及基于模型的指标,以反映模型的真实性能。分析表明,指令遵循模型在正确性方面具有竞争力,有时甚至优于微调模型,但这些模型难以严格遵循所提供的知识,且常在回答中产生幻觉。我们希望本研究能促进对问答任务中指令遵循模型展开更全面的评估。我们的代码与数据已公开于 https://github.com/McGill-NLP/instruct-qa。