This work pioneers evaluating emergent planning capabilities based on situational awareness in large language models. We contribute (i) novel benchmarks and metrics for standardized assessment; (ii) a unique dataset to spur progress; and (iii) demonstrations that prompting and multi-agent schemes significantly enhance planning performance in context-sensitive planning tasks. Positioning this within a situated agent and automated planning research, we highlight inherent reliability challenges--efficiently mapping world states to actions without environmental guidance remains open despite simulated domain advances. Although out-of-scope, limitations around validation methodology and data availability indicate exciting directions, including fine-tuning on expanded planning corpora and optimizations for triggering fast latent planning. By conclusively demonstrating current methods' promise and limitations via rigorous comparison, we catalyze investigating reliable goal-directed reasoning for situated agents.
翻译:本工作开创性地评估了基于情境感知的大语言模型在涌现性规划能力方面的表现。我们贡献了:(i) 用于标准化评估的新型基准与度量体系;(ii) 推动领域发展的独特数据集;(iii) 证明提示工程与多智能体方案能显著提升情境敏感型规划任务的性能。将该工作置于具身智能体与自动化规划研究框架中,我们揭示了内在的可靠性挑战——尽管模拟领域取得进展,但缺乏环境引导时如何高效地将世界状态映射为行动仍是开放问题。虽不在本研究范围内,验证方法与数据可用性方面的局限性揭示了激动人心的方向,包括基于扩展规划语料库的微调、以及触发生成快速隐式规划的优化策略。通过严谨对比实验明确展示当前方法的潜力与局限,我们为具身智能体开展可靠目标导向推理研究注入催化动力。