Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception

Vision-language models (VLMs) have shown powerful capabilities in visual question answering and reasoning tasks by combining visual representations with the abstract skill set large language models (LLMs) learn during pretraining. Vision, while the most popular modality to augment LLMs with, is only one representation of a scene. In human-robot interaction scenarios, robot perception requires accurate scene understanding by the robot. In this paper, we define and demonstrate a method of aligning the embedding spaces of different modalities (in this case, inertial measurement unit (IMU) data) to the vision embedding space through a combination of supervised and contrastive training, enabling the VLM to understand and reason about these additional modalities without retraining. We opt to give the model IMU embeddings directly over using a separate human activity recognition model that feeds directly into the prompt to allow for any nonlinear interactions between the query, image, and IMU signal that would be lost by mapping the IMU data to a discrete activity label. Further, we demonstrate our methodology's efficacy through experiments involving human activity recognition using IMU data and visual inputs. Our results show that using multiple modalities as input improves the VLM's scene understanding and enhances its overall performance in various tasks, thus paving the way for more versatile and capable language models in multi-modal contexts.

翻译：视觉-语言模型通过结合视觉表示与大语言模型在预训练阶段习得的抽象技能集，在视觉问答与推理任务中展现出强大能力。视觉作为增强大语言模型的最常用模态，仅是场景表征的一个方面。在人机交互场景中，机器人感知需要机器人对场景进行精确理解。本文定义并论证了一种方法，通过监督训练与对比训练相结合的方式，将不同模态（本文中为惯性测量单元数据）的嵌入空间对齐至视觉嵌入空间，从而使视觉-语言模型无需重训练即可理解并推理这些新增模态。我们选择直接将IMU嵌入输入模型，而非采用独立的人类活动识别模型生成提示词，从而保留查询、图像与IMU信号之间可能存在的非线性交互（若将IMU数据映射为离散活动标签则会丢失这些交互）。此外，我们通过基于IMU数据与视觉输入的人类活动识别实验验证了该方法的有效性。结果表明，采用多模态输入可提升视觉-语言模型对场景的理解能力，并增强其在各类任务中的整体表现，从而为多模态环境下更具适应性与能力的语言模型铺平道路。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日