The recent development of chain-of-thought (CoT) decoding has enabled large language models (LLMs) to generate explicit logical reasoning paths for complex problem-solving. However, research indicates that these paths are not always deliberate and optimal. The tree-of-thought (ToT) method employs tree-searching to extensively explore the reasoning space and find better reasoning paths that CoT decoding might overlook. This deliberation, however, comes at the cost of significantly increased inference complexity. In this work, we demonstrate that fine-tuning LLMs leveraging the search tree constructed by ToT allows CoT to achieve similar or better performance, thereby avoiding the substantial inference burden. This is achieved through Chain of Preference Optimization (CPO), where LLMs are fine-tuned to align each step of the CoT reasoning paths with those of ToT using the inherent preference information in the tree-search process. Extensive experimental results show that CPO significantly improves LLM performance in solving a variety of complex problems, including question answering, fact verification, and arithmetic reasoning, demonstrating its effectiveness. Our code is available at https://github.com/sail-sg/CPO.
翻译:近期发展的思维链解码技术使大型语言模型能够为复杂问题生成显式的逻辑推理路径。然而研究表明,这些路径并非总是经过深思熟虑且最优的。思维树方法通过树搜索广泛探索推理空间,以发现思维链解码可能忽略的更优推理路径。但这种审慎性是以显著增加的推理复杂度为代价的。本研究表明,利用思维树构建的搜索树对大型语言模型进行微调,可使思维链达到相似或更优的性能,从而避免巨大的推理负担。这一目标通过链式偏好优化实现,该方法利用树搜索过程中固有的偏好信息微调大型语言模型,使其思维链推理路径的每一步与思维树路径对齐。大量实验结果表明,链式偏好优化在问答、事实核查和算术推理等多种复杂问题求解中显著提升了大型语言模型的性能,证明了其有效性。我们的代码公开于 https://github.com/sail-sg/CPO。