Recent advancements in large language models (LLMs) have spurred interest in using them for generating robot programs from natural language, with promising initial results. We investigate the use of LLMs to generate programs for service mobile robots leveraging mobility, perception, and human interaction skills, and where accurate sequencing and ordering of actions is crucial for success. We contribute CodeBotler, an open-source robot-agnostic tool to program service mobile robots from natural language, and RoboEval, a benchmark for evaluating LLMs' capabilities of generating programs to complete service robot tasks. CodeBotler performs program generation via few-shot prompting of LLMs with an embedded domain-specific language (eDSL) in Python, and leverages skill abstractions to deploy generated programs on any general-purpose mobile robot. RoboEval evaluates the correctness of generated programs by checking execution traces starting with multiple initial states, and checking whether the traces satisfy temporal logic properties that encode correctness for each task. RoboEval also includes multiple prompts per task to test for the robustness of program generation. We evaluate several popular state-of-the-art LLMs with the RoboEval benchmark, and perform a thorough analysis of the modes of failures, resulting in a taxonomy that highlights common pitfalls of LLMs at generating robot programs. We release our code and benchmark at https://amrl.cs.utexas.edu/codebotler/.
翻译:大语言模型(LLMs)的最新进展激发了人们利用其从自然语言生成机器人程序的兴趣,并已取得初步成效。本文研究利用LLMs生成服务型移动机器人程序,这些程序需利用移动、感知和人际交互技能,且准确的行动序列与顺序对于任务成功至关重要。我们贡献了两个成果:CodeBotler——一个开源、与机器人无关的工具,可从自然语言编程服务型移动机器人;以及RoboEval——一个评估LLMs生成服务机器人任务程序能力的基准测试。CodeBotler通过基于Python嵌入式领域特定语言(eDSL)的少样本提示驱动LLMs生成程序,并利用技能抽象将生成程序部署到任意通用移动机器人上。RoboEval通过从多个初始状态开始检查执行轨迹,验证轨迹是否满足编码每项任务正确性的时序逻辑属性,从而评估生成程序的正确性。RoboEval还包括每项任务的多个提示,以测试程序生成的鲁棒性。我们使用RoboEval基准测试评估了多个流行的最先进LLMs,并对失败模式进行了深入分析,最终构建了一个分类体系,突出了LLMs生成机器人程序时的常见陷阱。我们的代码和基准测试已发布在https://amrl.cs.utexas.edu/codebotler/。