This paper aims to develop a framework that enables a robot to execute tasks based on visual information, in response to natural language instructions for Fetch-and-Carry with Object Grounding (FCOG) tasks. Although there have been many frameworks, they usually rely on manually given instruction sentences. Therefore, evaluations have only been conducted with fixed tasks. Furthermore, many multimodal language understanding models for the benchmarks only consider discrete actions. To address the limitations, we propose a framework for the full automation of the generation, execution, and evaluation of FCOG tasks. In addition, we introduce an approach to solving the FCOG tasks by dividing them into four distinct subtasks.
翻译:本文旨在开发一个框架,使机器人能够基于视觉信息执行任务,以响应针对带物体定位的取物搬运(FCOG)任务的自然语言指令。尽管已有诸多框架,但通常依赖手动给定的指令语句,因此评估仅限于固定任务。此外,许多用于基准测试的多模态语言理解模型仅考虑离散动作。为克服这些限制,我们提出一个实现FCOG任务全自动化生成、执行与评估的框架。同时,我们引入一种通过将FCOG任务分解为四个不同子任务来解决该问题的方法。