There is a growing interest in applying large language models (LLMs) in robotic tasks, due to their remarkable reasoning ability and extensive knowledge learned from vast training corpora. Grounding LLMs in the physical world remains an open challenge as they can only process textual input. Recent advancements in large vision-language models (LVLMs) have enabled a more comprehensive understanding of the physical world by incorporating visual input, which provides richer contextual information than language alone. In this work, we proposed a novel paradigm that leveraged GPT-4V(ision), the state-of-the-art LVLM by OpenAI, to enable embodied agents to perceive liquid objects via image-based environmental feedback. Specifically, we exploited the physical understanding of GPT-4V to interpret the visual representation (e.g., time-series plot) of non-visual feedback (e.g., F/T sensor data), indirectly enabling multimodal perception beyond vision and language using images as proxies. We evaluated our method using 10 common household liquids with containers of various geometry and material. Without any training or fine-tuning, we demonstrated that our method can enable the robot to indirectly perceive the physical response of liquids and estimate their viscosity. We also showed that by jointly reasoning over the visual and physical attributes learned through interactions, our method could recognize liquid objects in the absence of strong visual cues (e.g., container labels with legible text or symbols), increasing the accuracy from 69.0% -- achieved by the best-performing vision-only variant -- to 86.0%.
翻译:随着大语言模型(LLMs)因其卓越的推理能力和从海量训练语料中获取的广泛知识,其在机器人任务中的应用日益受到关注。然而,将LLMs扎根于物理世界仍是一项开放挑战,因为它们仅能处理文本输入。近来,大型视觉-语言模型(LVLMs)的进步通过整合视觉输入,使得对物理世界的理解更为全面,这比单纯的语言提供了更丰富的上下文信息。在本研究中,我们提出了一种新颖范式,利用OpenAI的最先进LVLM——GPT-4V(ision),使具身智能体能够通过基于图像的環境反馈感知液体对象。具体而言,我们利用了GPT-4V的物理理解能力,解读非视觉反馈(如力/力矩传感器数据)的视觉表征(例如时间序列图),从而以图像为代理间接实现超越视觉和语言的多模态感知。我们使用10种常见家用液体及不同几何形状和材质的容器评估了该方法。无需任何训练或微调,我们证明了该方法能使机器人间接感知液体的物理响应并估计其粘度。我们还表明,通过联合推理在交互中学到的视觉和物理属性,该方法能在缺乏强视觉线索(如带有清晰文字或符号的容器标签)的情况下识别液体对象,将准确率从最佳纯视觉变体达到的69.0%提升至86.0%。