Recent studies suggest that self-reflective prompting can significantly enhance the reasoning capabilities of Large Language Models (LLMs). However, the use of external feedback as a stop criterion raises doubts about the true extent of LLMs' ability to emulate human-like self-reflection. In this paper, we set out to clarify these capabilities under a more stringent evaluation setting in which we disallow any kind of external feedback. Our findings under this setting show a split: while self-reflection enhances performance in TruthfulQA, it adversely affects results in HotpotQA. We conduct follow-up analyses to clarify the contributing factors in these patterns, and find that the influence of self-reflection is impacted both by reliability of accuracy in models' initial responses, and by overall question difficulty: specifically, self-reflection shows the most benefit when models are less likely to be correct initially, and when overall question difficulty is higher. We also find that self-reflection reduces tendency toward majority voting. Based on our findings, we propose guidelines for decisions on when to implement self-reflection. We release the codebase for reproducing our experiments at https://github.com/yanhong-lbh/LLM-SelfReflection-Eval.
翻译:近期研究表明,自我反思提示能显著增强大语言模型的推理能力。然而,将外部反馈作为终止评判标准的做法,引发了对大语言模型模拟人类自我反思真实程度的质疑。本文在更严格的评估设置下(禁止任何形式的外部反馈)力图澄清这些能力。在此设置下,我们的发现呈现分化现象:虽然自我反思能提升TruthfulQA任务的表现,但在HotpotQA任务中却产生了负面影响。我们通过后续分析阐明这些模式的影响因素,发现自我反思的效果受到模型初始响应的准确性可靠性及整体问题难度的共同影响:具体而言,当模型初始正确概率较低且问题整体难度较高时,自我反思能发挥最大效用。我们还发现自我反思会降低多数投票机制的倾向性。基于研究发现,我们提出何时实施自我反思的决策指南。我们已在https://github.com/yanhong-lbh/LLM-SelfReflection-Eval 公开可复现实验的代码库。