Human robot interaction is an exciting task, which aimed to guide robots following instructions from human. Since huge gap lies between human natural language and machine codes, end to end human robot interaction models is fair challenging. Further, visual information receiving from sensors of robot is also a hard language for robot to perceive. In this work, HuBo-VLM is proposed to tackle perception tasks associated with human robot interaction including object detection and visual grounding by a unified transformer based vision language model. Extensive experiments on the Talk2Car benchmark demonstrate the effectiveness of our approach. Code would be publicly available in https://github.com/dzcgaara/HuBo-VLM.
翻译:人机交互是一项激动人心的任务,旨在引导机器人遵循人类指令。由于人类自然语言与机器代码之间存在巨大鸿沟,端到端的人机交互模型极具挑战性。此外,机器人传感器接收到的视觉信息对于机器人而言也是一种难以理解的语言。本工作中,我们提出了HuBo-VLM,通过一种基于Transformer的统一视觉语言模型,来解决与人机交互相关的感知任务,包括目标检测和视觉定位。在Talk2Car基准上的大量实验验证了我们方法的有效性。代码将公开在https://github.com/dzcgaara/HuBo-VLM。