The application scope of large language models (LLMs) is increasingly expanding. In practical use, users might provide feedback based on the model's output, hoping for a responsive model that can complete responses according to their feedback. Whether the model can appropriately respond to users' refuting feedback and consistently follow through with execution has not been thoroughly analyzed. In light of this, this paper proposes a comprehensive benchmark, RefuteBench, covering tasks such as question answering, machine translation, and email writing. The evaluation aims to assess whether models can positively accept feedback in form of refuting instructions and whether they can consistently adhere to user demands throughout the conversation. We conduct evaluations on numerous LLMs and find that LLMs are stubborn, i.e. exhibit inclination to their internal knowledge, often failing to comply with user feedback. Additionally, as the length of the conversation increases, models gradually forget the user's stated feedback and roll back to their own responses. We further propose a recall-and-repeat prompts as a simple and effective way to enhance the model's responsiveness to feedback.
翻译:大语言模型(LLM)的应用范围日益扩展。在实际使用中,用户可能根据模型输出提供反馈,期望模型能够响应用户反馈并据此完善回答。然而,模型能否恰当回应用户的反驳性反馈,并始终执行用户指令,尚未得到充分分析。为此,本文提出综合性基准测试RefuteBench,涵盖问答、机器翻译、邮件撰写等任务。评估旨在检验模型是否能以积极态度接受以反驳指令形式呈现的反馈,并在对话全程持续遵循用户需求。我们对大量LLM进行了评估,发现模型具有“固执性”,即倾向于依赖自身内部知识,常未能遵循用户反馈。此外,随着对话长度增加,模型会逐渐遗忘用户提出的反馈,回退至其原始回答。我们进一步提出“回忆-重复提示”方法,作为一种简单有效的增强模型对反馈响应能力的手段。