To improve the ability of the large language model (LLMs) to tackle complex reasoning problems, chain-of-thoughts (CoT) methods were proposed to guide LLMs to reason step-by-step, enabling problem solving from simple to complex. State-of-the-art methods for generating such a chain involve interactive collaboration, where the learner generates candidate intermediate thoughts, evaluated by the LLM, guiding the generation of subsequent thoughts. However, a widespread yet understudied problem is that the evaluation from the LLM is typically noisy and unreliable, potentially misleading the generation process in selecting promising intermediate thoughts. In this paper, motivated by Vapnik's principle, we use pairwise-comparison evaluation instead of point-wise scoring to search for promising intermediate thoughts with the noisy feedback from the LLM. In each round, we randomly pair intermediate thoughts and directly prompt the LLM to select the more promising one from each pair, allowing us to identify the most promising thoughts through an iterative process. To further alleviate the noise in the comparison, we incorporate techniques from ensemble learning and dueling bandits, proposing two variants of the algorithm. Experiments on three real-world tasks demonstrate the effectiveness of our proposed algorithm and verify the rationale of the pairwise comparison mechanism.
翻译:为提升大语言模型处理复杂推理问题的能力,研究者提出了思维链方法,引导大语言模型进行逐步推理,实现从简单到复杂的问题求解。当前最先进的思维链生成方法采用交互式协作机制:学习者生成候选中间思维,由大语言模型进行评估,从而指导后续思维的生成。然而,一个普遍存在但尚未被充分研究的问题是:大语言模型给出的评估通常存在噪声且不可靠,可能在选择有前景的中间思维时误导生成过程。受Vapnik原理启发,本文采用成对比较评估替代点式评分方法,利用大语言模型的噪声反馈搜索有前景的中间思维。在每轮迭代中,我们将中间思维随机配对,直接提示大语言模型从每对中选择更具前景的思维,通过迭代过程识别最优候选。为缓解比较过程中的噪声干扰,我们融合集成学习与对决赌博机技术,提出两种算法变体。在三个现实任务上的实验验证了所提算法的有效性,并证实了成对比较机制的理论合理性。