Reinforcement Learning (RL) policies often degrade in unfamiliar environments because they lack explicit deliberation. We propose Plan, Align, Commit, Think (PACT), a hybrid architecture that combines a fast, reactive RL policy with a slow, deliberative Small Language Model (SLM) planner. PACT invokes the SLM asynchronously to generate and validate candidate action plans. Once a plan is verified through simulation as safe, feasible, and complete, it is executed directly, bypassing the RL policy without retraining or modifying it. Evaluated on three FrozenLake configurations of increasing difficulty, PACT outperforms all baselines while relying on a 2B-parameter SLM backbone, suggesting that deliberative planning and reactive execution are more powerful in concert than either is alone in these settings.
翻译:强化学习策略常因缺乏显式深思而在陌生环境中性能退化。我们提出Plan, Align, Commit, Think (PACT)混合架构,将快速反应式强化学习策略与慢速深思式小型语言模型规划器相结合。PACT异步调用小型语言模型生成并验证候选动作规划。一旦规划通过仿真验证为安全、可行且完整,便直接执行,无需重新训练或修改强化学习策略。在三个难度递增的FrozenLake配置上的评估显示,PACT依托2B参数级小型语言模型主干,性能超越所有基线,表明深思式规划与反应式执行协同作用比单独使用其中任何一种更为强大。