Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile, self-reflective (self-corrective) prompting has been widely claimed to enhance model reliability by prompting LLMs to critique and revise their own reasoning, yet its effectiveness in safety-critical medical settings remains unclear. In this work, we conduct an exploratory analysis of self-reflective reasoning for medical multiple-choice question answering: using GPT-4o and GPT-4o-mini, we compare standard CoT prompting with an iterative self-reflection loop and track how predictions evolve across reflection steps on three widely used medical QA benchmarks (MedQA, HeadQA, and PubMedQA). We analyze whether self-reflection leads to error correction, error persistence, or the introduction of new errors. Our results show that self-reflective prompting does not consistently improve accuracy and its impact is highly dataset- and model-dependent: it yields modest gains on MedQA but provides limited or negative benefits on HeadQA and PubMedQA, and increasing the number of reflection steps does not guarantee better performance. These findings highlight a gap between reasoning transparency and reasoning correctness, suggesting that self-reflective reasoning is better viewed as an analytical tool for understanding model behavior rather than a standalone solution for improving medical QA reliability.
翻译:大型语言模型(LLMs)在医学问答(medical QA)中已展现出强劲性能,而思维链(CoT)提示通过激发显式的中间推理进一步提升了结果;与此同时,自我反思(自我纠正)提示被广泛声称能通过引导LLMs批判和修正自身推理来增强模型可靠性,然而其在安全关键的医学场景中的有效性仍不明确。在本研究中,我们对医学选择题问答中的自我反思推理进行了探索性分析:使用GPT-4o和GPT-4o-mini,我们比较了标准CoT提示与迭代自我反思循环,并追踪了在三个广泛使用的医学问答基准(MedQA、HeadQA和PubMedQA)上,预测结果在反思步骤中的演变。我们分析了自我反思是否导致错误纠正、错误持续或引入新错误。我们的结果表明,自我反思提示并未持续提升准确性,其影响高度依赖于数据集和模型:它在MedQA上带来适度提升,但在HeadQA和PubMedQA上效果有限或产生负面影响,且增加反思步骤数并不保证性能改善。这些发现凸显了推理透明度与推理正确性之间的差距,表明自我反思推理更适合被视为分析模型行为的工具,而非改进医学问答可靠性的独立解决方案。