The correct model response in the face of uncertainty is to abstain from answering a question so as not to mislead the user. In this work, we study the ability of LLMs to abstain from answering context-dependent science questions when provided insufficient or incorrect context. We probe model sensitivity in several settings: removing gold context, replacing gold context with irrelevant context, and providing additional context beyond what is given. In experiments on four QA datasets with four LLMs, we show that performance varies greatly across models, across the type of context provided, and also by question type; in particular, many LLMs seem unable to abstain from answering boolean questions using standard QA prompts. Our analysis also highlights the unexpected impact of abstention performance on QA task accuracy. Counter-intuitively, in some settings, replacing gold context with irrelevant context or adding irrelevant context to gold context can improve abstention performance in a way that results in improvements in task performance. Our results imply that changes are needed in QA dataset design and evaluation to more effectively assess the correctness and downstream impacts of model abstention.
翻译:面对不确定性时,模型的正确响应是拒绝回答问题,以避免误导用户。本研究探讨了当提供的上下文不充分或错误时,LLM在上下文依赖的科学问题中拒绝回答的能力。我们通过多种设置探测模型敏感性:移除黄金上下文、用无关上下文替代黄金上下文、以及提供超出给定范围的额外上下文。在四个LLM对四个问答数据集的实验中,我们发现模型表现因模型类型、上下文类型、以及问题类型而异;尤其值得注意的是,许多LLM在使用标准问答提示时似乎无法拒绝回答布尔问题。我们的分析还揭示了弃权行为对问答任务准确性的意外影响。反直觉的是,在某些情况下,用无关上下文替代黄金上下文或在黄金上下文中添加无关上下文,能提升弃权表现并进而改善任务性能。研究结果表明,需要调整问答数据集的设计与评估方法,以更有效地衡量模型弃权行为的正确性及其下游影响。