REPOT: Recoverable Program-of-Thought via Checkpoint Repair

One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates the trajectory. We introduce RePoT (Recoverable PoT): a deterministic verified replay that walks the plan through the environment to its first invalid transition, then one LLM call that resumes from the verified prefix. RePoT costs at most one extra LLM call on the ~14% of problems where PoT fails. RePoT beats PoT by +3 to +11pp across four closed-model configurations on PuzzleZoo-775 and peaks at 96.9% vs 86.3% on gpt-5.4-mini-medium; against the matched-budget PoT-retry baseline, RePoT wins decisively on Gemini (+3.8pp, 95% CI [+2.2,+5.4]), is within sampling noise on GPT-medium and Claude, and loses on GPT-mini -- a capability-scaling pattern we begin to address with Adaptive RePoT, a rule-based dispatcher that routes between suffix repair and a fresh PoT retry based on verified-prefix length (preliminary). We replicate on PlanBench Blocksworld (+1.1 to +11.4pp) and on four open-weights models (+3.3 to +20.0pp on three of four). On Derail-550, our controlled recovery benchmark, every condition with access to checkpoint information clears >=30% on GPT-medium and >=70% on Gemini, vs <=3.1% for error-only feedback -- showing that checkpoint information, not the specific verified-prefix tail, is the load-bearing recovery signal.

翻译：单次思维程序（Program-of-Thought, PoT）会生成一个输出原始动作计划的Python程序；其中任何一个无效动作都会导致整个轨迹无效。我们提出REPoT（可恢复思维程序）：一种确定性验证回放机制，它将计划与环境交互至首个无效转换处，然后通过单次LLM调用从已验证的前缀继续执行。在波问题中，REPoT仅在约14%的PoT失败问题上额外消耗一次LLM调用。在PuzzleZoo-775数据集上，REPoT在四种闭源模型配置下比PoT高出3至11个百分点，并在gpt-5.4-mini-medium上达到96.9%对86.3%的峰值；与同等预算的PoT重试基线相比，REPoT在Gemini上以3.8个百分点（95%置信区间[+2.2,+5.4]）的显著优势获胜，在GPT-medium和Claude上处于采样噪声范围内，在GPT-mini上则表现逊色——针对这种能力缩放规律，我们提出自适应REPoT进行初步应对：一种基于规则的分发器，根据已验证前缀长度在后缀修复与全新PoT重试之间选择路径（初步结果）。我们在PlanBench Blocksworld数据集（+1.1至11.4个百分点）及四个开源权重模型（其中三个模型提升3.3至20.0个百分点）上进行了重复验证。在受控恢复基准Derail-550上，所有能访问检查点信息的条件在GPT-medium上均能达到≥30%的成功率，在Gemini上达到≥70%，而仅提供错误反馈的条件成功率≤3.1%——这表明检查点信息（而非特定已验证前缀的尾部）才是起关键作用的恢复信号。

相关内容

Gemini

关注 12

2023年12 月 6 日，谷歌 CEO 桑达尔・皮查伊官宣 Gemini 1.0 版正式上线。这次发布的 Gemini 大模型是原生多模态大模型，是谷歌大模型新时代的第一步，它包括三种量级：能力最强的 Gemini Ultra，适用于多任务的 Gemini Pro 以及适用于特定任务和端侧的 Gemini Nano。

黑匣子被打开了！能玩的Transformer可视化解释工具，本地运行GPT-2、还可实时推理

专知会员服务

36+阅读 · 2024年8月11日

南大清华发布《从单目图像中恢复三维人体网格》综述论文，涵盖246篇文献全年阐述单目3D人体网格恢复研究进展

专知会员服务

33+阅读 · 2022年3月21日

中科院计算所发布首篇「面向第一阶段检索的语义检索模型」综述论文，43页pdf242篇文献

专知会员服务

25+阅读 · 2021年10月3日

你的论文可复现么？这个视频报告《机器学习中的复现性:从理论到实践》带你做复现研究，84页ppt

专知会员服务

48+阅读 · 2020年8月8日