LLM-Flax : Generalizable Robotic Task Planning via Neuro-Symbolic Approaches with Large Language Models

Deploying a neuro-symbolic task planner on a new domain today requires significant manual effort: a domain expert must author relaxation and complementary rules, and hundreds of training problems must be solved to supervise a Graph Neural Network (GNN) object scorer. We propose LLM-Flax, a three-stage framework that eliminates all three sources of manual effort using a locally hosted LLM given only a PDDL domain file. Stage 1 automatically generates relaxation and complementary rules via structured prompting with format validation and self-correction. Stage 2 introduces LLM-guided failure recovery with a feasibility-gated budget policy that explicitly reserves API latency cost before each LLM call, preventing the downstream relaxation fallback from being starved. Stage 3 replaces the domain-trained GNN entirely with zero-shot LLM object importance scoring, requiring no training data. We evaluate all three stages on the MazeNamo benchmark across 10x10, 12x12, and 15x15 grids (8 benchmarks total). LLM-Flax achieves average SR 0.945 versus the manual baseline's 0.828 (+0.117), matching or outperforming manual rules on every one of the eight benchmarks. On 12x12 Expert, LLM-Flax attains SR 0.733 where the manual planner fails entirely (SR 0.000); on 15x15 Hard, it achieves SR 1.000 versus Manual's 0.900. Stage 3 demonstrates feasibility (SR 0.720 on 12x12 Hard with no training data) but faces a context-window bottleneck at scale, pointing to the primary open challenge for future work.

翻译：在当前新的领域部署神经符号任务规划器需要大量人工工作：领域专家必须编写松弛规则和补充规则，且需要解决数百个训练问题来监督图神经网络（GNN）对象评分器。我们提出LLM-Flax框架，这是一个三阶段框架，仅需给定PDDL领域文件，通过本地部署的大语言模型（LLM）即可消除上述所有人工工作。第一阶段通过结构化提示、格式验证与自我纠正自动生成松弛规则和补充规则。第二阶段引入LLM引导的故障恢复机制，采用可行性门控预算策略——在每次LLM调用前显式预留API延迟成本，防止下游松弛回退机制资源枯竭。第三阶段完全取代领域训练的GNN，采用零样本LLM对象重要性评分，无需训练数据。我们在MazeNamo基准测试的10×10、12×12和15×15网格（共8个基准）上评估了所有三个阶段。LLM-Flax平均成功率达0.945，相较于人工基线的0.828提升0.117，在全部八个基准测试中均达到或超越人工规则表现。在12×12 Expert场景中，LLM-Flax达到0.733成功率，而人工规划器完全失败（成功率为0.000）；在15×15 Hard场景中，其成功率（1.000）超越人工方法（0.900）。第三阶段验证了可行性（在12×12 Hard场景中零训练数据下成功率达0.720），但面临大规模场景下的上下文窗口瓶颈，这指向未来工作的主要开放性挑战。