Large Language Models (LLMs) have shown strong promise for robotic task planning, particularly through the automatic generation of symbolic planning domains. However, prior work mainly treats generated domains as planning utilities. Such pipelines remain brittle under imperfect logical states and perception noise, while overlooking the potential of generated domains as scalable sources of reasoning supervision and structured reward signals. At the same time, reasoning LLMs depend on chain-of-thought (CoT) supervision, which is expensive to collect for robotic tasks, and reinforcement learning (RL) faces challenges in reward engineering. We propose Self-CriTeach, an LLM self-teaching and self-critiquing framework in which an LLM autonomously generates symbolic planning domains that serve a dual role: (1) In the self-teaching stage, generated domains are used to produce large-scale robotic planning problem--plan pairs, which are automatically converted into extended CoT trajectories for supervised fine-tuning. (2) In the self-critiquing stage, the same domains are reused as structured reward functions, providing dense feedback for reinforcement learning without manual reward engineering. This unified training pipeline yields a planning-enhanced LLM with higher planning success rates, stronger cross-task generalization, reduced inference cost, and improved resistance to imperfect logical states. GitHub Page: https://markli1hoshipu.github.io/Plan_LLM/
翻译:[translated abstract in Chinese]
大语言模型(LLMs)在机器人任务规划领域展现出巨大潜力,尤其是通过自动生成符号化规划领域。然而,现有工作主要将生成领域视为规划工具。此类流程在不完善的逻辑状态和感知噪声下仍显脆弱,同时忽略了生成领域作为可扩展推理监督与结构化奖励信号的潜力。此外,推理型LLM依赖于思维链(CoT)监督,而此类监督在机器人任务中采集成本高昂;强化学习(RL)则在奖励工程方面面临挑战。我们提出Self-CriTeach框架——一种LLM自我教学与自我批判框架,其中LLM自主生成符号化规划领域,实现双重作用:(1)自我教学阶段:利用生成领域产生大规模机器人规划问题-规划对,并自动转化为扩展CoT轨迹用于监督微调。(2)自我批判阶段:复用相同领域作为结构化奖励函数,为RL提供密集反馈,无需人工奖励工程。这一统一训练流程可得到规划增强型LLM,其规划成功率更高、跨任务泛化能力更强、推理成本更低,且对不完善逻辑状态的抵抗能力提升。GitHub页面:https://markli1hoshipu.github.io/Plan_LLM/