Large language models (LLMs) struggle with complex, long-horizon reasoning due to instability caused by their frozen policy assumption. Current test-time scaling methods treat execution feedback merely as an external signal for filtering or rewriting trajectories, without internalizing it to improve the underlying reasoning strategy. Inspired by Popper's epistemology of "conjectures and refutations," we argue that intelligence requires real-time evolution of the model's policy through learning from failed attempts. We introduce Policy of Thoughts (PoT), a framework that recasts reasoning as a within-instance online optimization process. PoT first generates diverse candidate solutions via an efficient exploration mechanism, then uses Group Relative Policy Optimization (GRPO) to update a transient LoRA adapter based on execution feedback. This closed-loop design enables dynamic, instance-specific refinement of the model's reasoning priors. Experiments show that PoT dramatically boosts performance: a 4B model achieves 49.71% accuracy on LiveCodeBench, outperforming GPT-4o and DeepSeek-V3 despite being over 50 smaller.
翻译:大语言模型(LLM)因其固定的策略假设所导致的不稳定性,在处理复杂、长程推理任务时面临困难。现有的测试时扩展方法仅将执行反馈视为用于筛选或重写推理轨迹的外部信号,而未将其内化以改进底层的推理策略。受波普尔“猜想与反驳”认识论的启发,我们认为智能需要通过从失败尝试中学习来实现模型策略的实时演化。本文提出思维策略(PoT)框架,该框架将推理重新定义为实例内的在线优化过程。PoT首先通过高效探索机制生成多样化的候选解决方案,随后利用组相对策略优化(GRPO)基于执行反馈更新一个瞬时的LoRA适配器。这种闭环设计实现了对模型推理先验的动态、实例特异性优化。实验表明,PoT显著提升了模型性能:一个40亿参数的模型在LiveCodeBench上达到了49.71%的准确率,尽管其规模缩小超过50倍,但仍超越了GPT-4o和DeepSeek-V3的表现。