The test-time compute strategy, such as Chain-of-Thought (CoT), has significantly enhanced the ability of large language models to solve complex tasks like logical reasoning. However, empirical studies indicate that simply increasing the compute budget can sometimes lead to a collapse in test-time performance when employing typical task decomposition strategies such as CoT. This work hypothesizes that reasoning failures with larger compute budgets stem from static planning methods, which hardly perceive the intrinsic boundaries of LLM reasoning. We term it as the Limited Reasoning Space hypothesis and perform theoretical analysis through the lens of a non-autonomous stochastic dynamical system. This insight suggests that there is an optimal range for compute budgets; over-planning can lead to redundant feedback and may even impair reasoning capabilities. To exploit the compute-scaling benefits and suppress over-planning, this work proposes Halo, a model predictive control framework for LLM planning. Halo is designed for long-horizon tasks with reason-based planning and crafts an entropy-driven dual controller, which adopts a Measure-then-Plan strategy to achieve controllable reasoning. Experimental results demonstrate that Halo outperforms static baselines on complex long-horizon tasks by dynamically regulating planning at the reasoning boundary.
翻译:测试时计算策略,如思维链(Chain-of-Thought, CoT),显著提升了大语言模型解决逻辑推理等复杂任务的能力。然而,实证研究表明,当采用典型的任务分解策略(如CoT)时,单纯增加计算预算有时反而会导致测试时性能崩溃。本文假设,更大计算预算下的推理失败源于静态规划方法,这些方法难以感知大语言模型推理的内在边界。我们将其称为"有限推理空间"假说,并通过非自治随机动力系统的视角进行了理论分析。这一见解表明,计算预算存在一个最优范围;过度规划可能导致冗余反馈,甚至损害推理能力。为了充分利用计算扩展的优势并抑制过度规划,本文提出了Halo——一个用于大语言模型规划的模型预测控制框架。Halo专为基于推理的长程任务设计,构建了一个熵驱动的双控制器,采用"先测量后规划"策略以实现可控推理。实验结果表明,通过在推理边界动态调节规划,Halo在复杂长程任务上优于静态基线方法。