As Large Language Models (LLMs) achieve remarkable performance across various NLP tasks, their reliability becomes essential for widespread adoption. This paper focuses on Abstention Ability (AA), a critical yet under explored aspect of reliability - the ability of LLMs to refrain from answering questions when they are uncertain or when definitive answer is not possible, while maintaining question-answering (QA) task performance. While previous works have focused on understanding the recollection abilities of LLMs or their ability to identify imponderable/unanswerable questions, we believe there is a need for an effective AA evaluation method. Therefore, we propose a black-box evaluation methodology to examine and understand the AA of LLMs across a variety of multiple-choice QA tasks. We measure AA by rewarding models for abstaining from answering when their predictions are incorrect or when the questions are inherently unanswerable. We investigate three strategies, Strict Prompting, Verbal Confidence Thresholding, and Chain-of-Thought (CoT), to understand their impact on abstention across different LLMs. Our findings reveal that while even state-of-the-art LLMs like GPT-4 struggle with abstention, strategic prompting such as CoT, can significantly enhance this ability. Furthermore, we demonstrate that improving AA also leads to better overall QA task performance, underscoring the importance of evaluating AA in LLMs.
翻译:随着大语言模型(LLMs)在各种自然语言处理任务中取得显著性能,其可靠性对于广泛采用至关重要。本文聚焦于可靠性中关键但尚未充分探索的方面——弃答能力(AA),即LLMs在不确定或无法给出明确答案时能够克制回答,同时保持问答(QA)任务性能的能力。尽管先前的研究主要关注理解LLMs的记忆能力或其识别不可考量/无法回答问题的能力,但我们认为需要一种有效的AA评估方法。因此,我们提出一种黑盒评估方法,用于检验和理解LLMs在各种多项选择QA任务中的AA。我们通过奖励模型在预测错误或问题本身无法回答时选择弃答来度量AA。我们研究了三种策略——严格提示、语言置信度阈值法和思维链(CoT),以理解它们对不同LLMs弃答行为的影响。我们的研究结果表明,即使是像GPT-4这样的最先进LLMs也难以有效弃答,但策略性提示(如CoT)可以显著增强这种能力。此外,我们证明提升AA也能带来更好的整体QA任务性能,这凸显了评估LLMs中AA的重要性。