Vision-language models (VLMs) have shown powerful capabilities in visual question answering and reasoning tasks by combining visual representations with the abstract skill set large language models (LLMs) learn during pretraining. Vision, while the most popular modality to augment LLMs with, is only one representation of a scene. In human-robot interaction scenarios, robot perception requires accurate scene understanding by the robot. In this paper, we define and demonstrate a method of aligning the embedding spaces of different modalities (in this case, inertial measurement unit (IMU) data) to the vision embedding space through a combination of supervised and contrastive training, enabling the VLM to understand and reason about these additional modalities without retraining. We opt to give the model IMU embeddings directly over using a separate human activity recognition model that feeds directly into the prompt to allow for any nonlinear interactions between the query, image, and IMU signal that would be lost by mapping the IMU data to a discrete activity label. Further, we demonstrate our methodology's efficacy through experiments involving human activity recognition using IMU data and visual inputs. Our results show that using multiple modalities as input improves the VLM's scene understanding and enhances its overall performance in various tasks, thus paving the way for more versatile and capable language models in multi-modal contexts.
翻译:视觉-语言模型通过结合视觉表示与大语言模型在预训练阶段习得的抽象技能集,在视觉问答与推理任务中展现出强大能力。视觉作为增强大语言模型的最常用模态,仅是场景表征的一个方面。在人机交互场景中,机器人感知需要机器人对场景进行精确理解。本文定义并论证了一种方法,通过监督训练与对比训练相结合的方式,将不同模态(本文中为惯性测量单元数据)的嵌入空间对齐至视觉嵌入空间,从而使视觉-语言模型无需重训练即可理解并推理这些新增模态。我们选择直接将IMU嵌入输入模型,而非采用独立的人类活动识别模型生成提示词,从而保留查询、图像与IMU信号之间可能存在的非线性交互(若将IMU数据映射为离散活动标签则会丢失这些交互)。此外,我们通过基于IMU数据与视觉输入的人类活动识别实验验证了该方法的有效性。结果表明,采用多模态输入可提升视觉-语言模型对场景的理解能力,并增强其在各类任务中的整体表现,从而为多模态环境下更具适应性与能力的语言模型铺平道路。