Divergent thinking, the cognitive process of generating diverse solutions, is a hallmark of human creativity and problem-solving. For machines, sampling diverse solution trajectories in complex reasoning problems is crucial for robust outcomes, data augmentation, and enhanced model generalization. Large language models (LLMs) often struggle with generating high-quality, diverse reasoning. While supervised fine-tuning helps with quality, it requires extensive supervision data to capture the full diversity of solutions. Alternatively, reinforcement learning methods like PPO aim to find limited highest-reward solutions while neglecting the solution diversity, akin to convergent thinking. To address these limitations, we propose Flow of Reasoning (FoR) -- an efficient LLM training approach enabling diverse reasoning with minimal data. FoR formulates multi-step LLM reasoning as a Markovian flow from an initial state to terminal states. The formulation allows to adapt principled GFlowNet approaches to train the LLM as a policy, which is able to sample multiple reasoning paths with probabilities proportional to the unnormalized reward. Empirical results show that, with limited training data (e.g., 15 examples), FoR can discover diverse high-quality solutions that excel greatly beyond current state-of-the-art methods across three tasks, including embodied reasoning (BlocksWorld), math puzzle solving (Game24), and logical reasoning (PrOntoQA). Code is available at https://github.com/Yu-Fangxu/FoR.
翻译:发散思维作为产生多样化解决方案的认知过程,是人类创造力和问题解决的标志性特征。对于机器而言,在复杂推理问题中采样多样化解题轨迹对于获得鲁棒结果、实现数据增强以及提升模型泛化能力至关重要。大语言模型在生成高质量、多样化推理过程方面常面临困难。虽然监督微调有助于提升推理质量,但需要大量监督数据才能捕捉解决方案的完整多样性。另一方面,诸如PPO等强化学习方法旨在寻找有限的最优奖励解,却忽略了解法的多样性,类似于收敛思维。为应对这些局限性,我们提出思维流——一种能够以极少量数据实现多样化推理的高效大语言模型训练方法。FoR将多步大语言模型推理建模为从初始状态到终止状态的马尔可夫流。该形式化框架允许采用具有理论依据的GFlowNet方法来训练大语言模型策略,使其能够采样多条推理路径,且采样概率与未归一化奖励值成正比。实验结果表明,在有限训练数据条件下(例如15个样本),FoR能够发现多样化的高质量解决方案,在具身推理(BlocksWorld)、数学谜题求解(Game24)和逻辑推理(PrOntoQA)三项任务中的表现显著超越当前最先进方法。代码发布于https://github.com/Yu-Fangxu/FoR。