Large Language Models (LLMs) have recently shown strong promise for robotic task planning, particularly through automatic planning domain generation. Planning domains are brittle under imperfect logical states and perception noise; prior approaches largely treat generated planning domains as plan utilities, overlooking their potential as scalable sources of reasoning supervision and structured reward signals. At the same time, reasoning LLMs depend on chain-of-thought (CoT) supervision that is expensive to collect for robotic tasks, and reinforcement learning (RL) faces challenges in reward engineering. We propose Self-CriTeach, an LLM self-teaching and self-critiquing framework in which an LLM autonomously generates symbolic planning domains that serve a dual role: (i) enabling large-scale generation of robotic planning problem-plan pairs, and (ii) providing structured reward functions. First, the self-written domains enable large-scale generation of symbolic task plans, which are automatically transformed into extended CoT trajectories for supervised fine-tuning. Second, the self-written domains are reused as structured reward functions, providing dense feedback for reinforcement learning without manual reward engineering. This unified training pipeline yields a planning-enhanced LLM with higher planning success rates, stronger cross-task generalization, reduced inference cost, and improved robustness to imperfect logical states.
翻译:大语言模型(LLMs)近期在机器人任务规划方面展现出巨大潜力,特别是在自动规划领域生成方面。规划领域在不完善的逻辑状态和感知噪声下具有脆弱性;先前的研究大多将生成的规划领域视为规划工具,忽视了其作为可扩展推理监督源和结构化奖励信号的潜力。与此同时,推理型大语言模型依赖于思维链(CoT)监督,而这类监督数据在机器人任务中收集成本高昂,且强化学习(RL)在奖励工程方面面临挑战。我们提出Self-CriTeach框架,这是一种大语言模型自教自学与自评的机制,其中大语言模型自主生成符号化规划领域,该领域发挥双重作用:(i)支持大规模生成机器人规划问题-方案对,(ii)提供结构化奖励函数。首先,自主编写的规划领域能够大规模生成符号化任务规划,这些规划可自动转化为扩展的思维链轨迹,用于监督微调。其次,自主编写的规划领域被复用为结构化奖励函数,无需人工设计奖励即可为强化学习提供密集反馈。这一统一的训练流程最终产生一个规划能力增强的大语言模型,其规划成功率更高、跨任务泛化能力更强、推理成本更低,且对不完善逻辑状态的鲁棒性得到提升。