Recent advancements in large language models (LLMs) have spurred interest in using them for generating robot programs from natural language, with promising initial results. We investigate the use of LLMs to generate programs for service mobile robots leveraging mobility, perception, and human interaction skills, and where accurate sequencing and ordering of actions is crucial for success. We contribute CodeBotler, an open-source robot-agnostic tool to program service mobile robots from natural language, and RoboEval, a benchmark for evaluating LLMs' capabilities of generating programs to complete service robot tasks. CodeBotler performs program generation via few-shot prompting of LLMs with an embedded domain-specific language (eDSL) in Python, and leverages skill abstractions to deploy generated programs on any general-purpose mobile robot. RoboEval evaluates the correctness of generated programs by checking execution traces starting with multiple initial states, and checking whether the traces satisfy temporal logic properties that encode correctness for each task. RoboEval also includes multiple prompts per task to test for the robustness of program generation. We evaluate several popular state-of-the-art LLMs with the RoboEval benchmark, and perform a thorough analysis of the modes of failures, resulting in a taxonomy that highlights common pitfalls of LLMs at generating robot programs. We release our code and benchmark at https://amrl.cs.utexas.edu/codebotler/.
翻译:近期大语言模型(LLMs)的进展引发了将其用于从自然语言生成机器人程序的兴趣,并已取得初步成果。我们研究了利用LLMs为具备移动、感知及人机交互能力的服务移动机器人生成程序的方法,其中动作的准确排序与顺序执行对任务成功至关重要。我们贡献了CodeBotler——一个开源且与机器人无关的工具,可通过自然语言为服务移动机器人编程;以及RoboEval——一个用于评估LLMs生成服务机器人任务程序能力的基准测试。CodeBotler通过基于Python内嵌领域特定语言(eDSL)的少样本提示机制生成程序,并利用技能抽象将生成程序部署到任意通用移动机器人上。RoboEval通过检查从多个初始状态开始的执行轨迹,验证轨迹是否满足编码各任务正确性的时序逻辑属性,从而评估生成程序的正确性。此外,RoboEval为每个任务提供多条提示,以测试程序生成的鲁棒性。我们利用RoboEval基准测试评估了多种主流先进LLMs,并对失败模式进行了深入分析,最终形成了一份分类体系,揭示了LLMs在生成机器人程序时的常见缺陷。我们的代码及基准测试已发布至 https://amrl.cs.utexas.edu/codebotler/。