When robots perform long action sequences, users will want to easily and reliably find out what they have done. We therefore demonstrate the task of learning to summarize and answer questions about a robot agent's past actions using natural language alone. A single system with a large language model at its core is trained to both summarize and answer questions about action sequences given ego-centric video frames of a virtual robot and a question prompt. To enable training of question answering, we develop a method to automatically generate English-language questions and answers about objects, actions, and the temporal order in which actions occurred during episodes of robot action in the virtual environment. Training one model to both summarize and answer questions enables zero-shot transfer of representations of objects learned through question answering to improved action summarization. % involving objects not seen in training to summarize.
翻译:当机器人执行长序列动作时,用户希望轻松可靠地了解其已完成的操作。因此,我们展示了仅使用自然语言学习总结并回答关于机器人智能体过去动作的问题的任务。一个以大型语言模型为核心的系统被训练用于同时总结和回答关于动作序列的问题,其输入为虚拟机器人的自我中心视频帧及一个提问提示。为了实现问答训练,我们开发了一种方法,能够自动生成关于对象、动作以及虚拟环境中机器人动作片段中动作发生的时间顺序的英文问题与答案。将总结与问答任务训练于同一模型,可实现通过问答学习到的对象表征零样本迁移至改进的动作总结,从而涵盖训练总结中未见过的对象。