This work pioneers evaluating emergent planning capabilities based on situational awareness in large language models. We contribute (i) novel benchmarks and metrics for standardized assessment; (ii) a unique dataset to spur progress; and (iii) demonstrations that prompting and multi-agent schemes significantly enhance planning performance in context-sensitive planning tasks. Positioning this within a situated agent and automated planning research, we highlight inherent reliability challenges--efficiently mapping world states to actions without environmental guidance remains open despite simulated domain advances. Although out-of-scope, limitations around validation methodology and data availability indicate exciting directions, including fine-tuning on expanded planning corpora and optimizations for triggering fast latent planning. By conclusively demonstrating current methods' promise and limitations via rigorous comparison, we catalyze investigating reliable goal-directed reasoning for situated agents.
翻译:本工作开创性地评估了大语言模型中基于情境感知的涌现规划能力。我们贡献了:(i)用于标准化评估的新型基准与指标;(ii)推动该领域发展的独特数据集;(iii)证明提示工程与多智能体框架能显著提升上下文敏感型规划任务的性能。将该研究置于情境化智能体与自动化规划研究的背景下,我们揭示了固有的可靠性挑战——尽管在模拟领域已取得进展,但在无环境引导条件下高效地将世界状态映射为行动仍是待解决问题。尽管超出现有研究范围,验证方法论与数据可用性的局限预示着令人振奋的研究方向,包括基于扩展规划语料库的微调、以及触发快速隐性规划的优化策略。通过严格对比实验充分验证当前方法的潜力与不足,我们为情境化智能体的可靠目标导向推理研究注入了新动力。