Can Vision-Language Models Think from a First-Person Perspective?

Vision-language models (VLMs) have recently shown promising results in traditional downstream tasks. Evaluation studies have emerged to assess their abilities, with the majority focusing on the third-person perspective, and only a few addressing specific tasks from the first-person perspective. However, the capability of VLMs to "think" from a first-person perspective, a crucial attribute for advancing autonomous agents and robotics, remains largely unexplored. To bridge this research gap, we introduce EgoThink, a novel visual question-answering benchmark that encompasses six core capabilities with twelve detailed dimensions. The benchmark is constructed using selected clips from egocentric videos, with manually annotated question-answer pairs containing first-person information. To comprehensively assess VLMs, we evaluate eighteen popular VLMs on EgoThink. Moreover, given the open-ended format of the answers, we use GPT-4 as the automatic judge to compute single-answer grading. Experimental results indicate that although GPT-4V leads in numerous dimensions, all evaluated VLMs still possess considerable potential for improvement in first-person perspective tasks. Meanwhile, enlarging the number of trainable parameters has the most significant impact on model performance on EgoThink. In conclusion, EgoThink serves as a valuable addition to existing evaluation benchmarks for VLMs, providing an indispensable resource for future research in the realm of embodied artificial intelligence and robotics.

翻译：视觉-语言模型（VLM）近年来在传统下游任务中展现出令人瞩目的成果。针对其能力的评估研究不断涌现，但多数工作聚焦于第三人称视角，仅少数涉及第一人称视角的特定任务。然而，VLM能否以第一人称视角进行“思考”——这一对推动自主智能体与机器人技术至关重要的能力，仍鲜有探索。为填补这一研究空白，我们提出EgoThink——一个涵盖六大核心能力、十二个细化维度的新型视觉问答基准数据集。该基准通过选取自我中心视频片段构建，并辅以包含第一人称信息的人工标注问答对。为全面评估VLM，我们在EgoThink上对18个主流VLM进行了测试。考虑到答案的开放式格式，我们采用GPT-4作为自动评估器进行单答案评分。实验结果表明，尽管GPT-4V在多个维度表现领先，但所有被评估的VLM在第一人称视角任务中仍具有相当大的提升空间。同时，扩大可训练参数规模是提升模型在EgoThink上性能的最显著因素。总之，EgoThink为现有VLM评估基准提供了重要补充，为具身智能与机器人领域的未来研究提供了不可或缺的资源。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日