One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates the trajectory. We introduce RePoT (Recoverable PoT): a deterministic verified replay that walks the plan through the environment to its first invalid transition, then one LLM call that resumes from the verified prefix. RePoT costs at most one extra LLM call on the ~14% of problems where PoT fails. RePoT beats PoT by +3 to +11pp across four closed-model configurations on PuzzleZoo-775 and peaks at 96.9% vs 86.3% on gpt-5.4-mini-medium; against the matched-budget PoT-retry baseline, RePoT wins decisively on Gemini (+3.8pp, 95% CI [+2.2,+5.4]), is within sampling noise on GPT-medium and Claude, and loses on GPT-mini -- a capability-scaling pattern we begin to address with Adaptive RePoT, a rule-based dispatcher that routes between suffix repair and a fresh PoT retry based on verified-prefix length (preliminary). We replicate on PlanBench Blocksworld (+1.1 to +11.4pp) and on four open-weights models (+3.3 to +20.0pp on three of four). On Derail-550, our controlled recovery benchmark, every condition with access to checkpoint information clears >=30% on GPT-medium and >=70% on Gemini, vs <=3.1% for error-only feedback -- showing that checkpoint information, not the specific verified-prefix tail, is the load-bearing recovery signal.
翻译:单次思维程序(Program-of-Thought, PoT)会生成一个输出原始动作计划的Python程序;其中任何一个无效动作都会导致整个轨迹无效。我们提出REPoT(可恢复思维程序):一种确定性验证回放机制,它将计划与环境交互至首个无效转换处,然后通过单次LLM调用从已验证的前缀继续执行。在波问题中,REPoT仅在约14%的PoT失败问题上额外消耗一次LLM调用。在PuzzleZoo-775数据集上,REPoT在四种闭源模型配置下比PoT高出3至11个百分点,并在gpt-5.4-mini-medium上达到96.9%对86.3%的峰值;与同等预算的PoT重试基线相比,REPoT在Gemini上以3.8个百分点(95%置信区间[+2.2,+5.4])的显著优势获胜,在GPT-medium和Claude上处于采样噪声范围内,在GPT-mini上则表现逊色——针对这种能力缩放规律,我们提出自适应REPoT进行初步应对:一种基于规则的分发器,根据已验证前缀长度在后缀修复与全新PoT重试之间选择路径(初步结果)。我们在PlanBench Blocksworld数据集(+1.1至11.4个百分点)及四个开源权重模型(其中三个模型提升3.3至20.0个百分点)上进行了重复验证。在受控恢复基准Derail-550上,所有能访问检查点信息的条件在GPT-medium上均能达到≥30%的成功率,在Gemini上达到≥70%,而仅提供错误反馈的条件成功率≤3.1%——这表明检查点信息(而非特定已验证前缀的尾部)才是起关键作用的恢复信号。