This work pioneers evaluating emergent planning capabilities based on situational awareness in large language models. We contribute (i) novel benchmarks and metrics for standardized assessment; (ii) a unique dataset to spur progress; and (iii) demonstrations that prompting and multi-agent schemes significantly enhance planning performance in context-sensitive planning tasks. Positioning this within a situated agent and automated planning research, we highlight inherent reliability challenges--efficiently mapping world states to actions without environmental guidance remains open despite simulated domain advances. Although out-of-scope, limitations around validation methodology and data availability indicate exciting directions, including fine-tuning on expanded planning corpora and optimizations for triggering fast latent planning. By conclusively demonstrating current methods' promise and limitations via rigorous comparison, we catalyze investigating reliable goal-directed reasoning for situated agents.
翻译:本工作率先探索了基于态势感知的大语言模型中的涌现规划能力。我们贡献了:(i) 用于标准化评估的新型基准与度量体系;(ii) 推动研究进展的独特数据集;(iii) 证明提示工程与多智能体方案能显著提升情境敏感型规划任务的性能。将本研究置于具身智能体与自动化规划研究的框架下,我们揭示了其固有的可靠性挑战——尽管模拟领域已取得进展,但如何在缺乏环境引导的情况下高效地将世界状态映射为动作仍是未解难题。尽管超出当前研究范围,验证方法与数据可用性方面的局限性指明了未来方向,包括在扩展规划语料库上进行微调,以及优化触发快速隐式规划的机制。通过严格对比实验明确论证现有方法的优势与局限,本研究为具身智能体探索可靠的目标导向推理能力注入了新动力。