Chain-of-thought (COT) prompting can help large language models (LLMs) reason toward correct answers, but its efficacy in reasoning toward incorrect answers is unexplored. This strategy of process of elimination (PoE), when used with COT, has the potential to enhance interpretability in tasks like medical diagnoses of exclusion. Thus, we propose PoE with COT, a new task where LLMs must reason toward incorrect options on multiple-choice questions. We evaluate the ability of GPT-3.5, LLaMA-2, and Falcon to perform PoE with COT on 2-choice commonsense and scientific reasoning datasets. We show that PoE consistently underperforms directly choosing the correct answer. The agreement of these strategies is also lower than the self-consistency of each strategy. To study these issues further, we conduct an error analysis and give suggestions for future work.
翻译:链式思考(COT)提示能帮助大型语言模型(LLM)推理出正确答案,但该方法在推理出错误答案方面的效能尚待探究。当结合COT时,这种排除法(PoE)策略有望增强排除性医学诊断等任务的可解释性。为此,我们提出基于COT的排除法新任务,要求LLM在多选题中推理出错误选项。我们在二选一的常识推理与科学推理数据集上评估了GPT-3.5、LLaMA-2和Falcon执行PoE与COT的能力。研究表明,排除法的表现始终劣于直接选择正确答案的方法,且两种策略的答案一致性低于各自内部的自洽性。针对这些问题,我们进一步进行了错误分析,并为未来研究提出建议。