Long-horizon execution in Large Language Models (LLMs) remains unstable even when high-level strategies are provided. Evaluating on controlled algorithmic puzzles, we demonstrate that while decomposition is essential for stability, extreme decomposition creates a "no-recovery bottleneck". We show that this bottleneck becomes critical due to highly non-uniform error distribution, where consistent errors on a few "hard" steps become irreversible. To address this, we propose Lookahead-Enhanced Atomic Decomposition (LEAD). By incorporating short-horizon future validation and aggregating overlapping rollouts, LEAD provides enough isolation to maintain stability while retaining enough local context to correct errors. This enables the o4-mini model to solve Checkers Jumping up to complexity $n=13$, whereas extreme decomposition fails beyond $n=11$.
翻译:大型语言模型(LLMs)在长程执行过程中,即使提供了高层策略,其稳定性仍然不足。通过在受控算法谜题上进行评估,我们证明:尽管分解对稳定性至关重要,但极端分解会引发“无恢复瓶颈”。我们揭示,这一瓶颈因错误分布的高度非均匀性而变得关键——在少数“困难”步骤上持续出现的错误将变得不可逆。为解决此问题,我们提出前瞻增强原子分解(LEAD)。通过引入短程未来验证并聚合重叠的展开结果,LEAD在提供足够隔离性以维持稳定性的同时,保留了足够的局部上下文来修正错误。这使得o4-mini模型能够解决复杂度达$n=13$的跳棋跳跃问题,而极端分解在$n=11$时即告失效。