Vision-language models (VLMs) have recently shown promising results in traditional downstream tasks. Evaluation studies have emerged to assess their abilities, with the majority focusing on the third-person perspective, and only a few addressing specific tasks from the first-person perspective. However, the capability of VLMs to "think" from a first-person perspective, a crucial attribute for advancing autonomous agents and robotics, remains largely unexplored. To bridge this research gap, we introduce EgoThink, a novel visual question-answering benchmark that encompasses six core capabilities with twelve detailed dimensions. The benchmark is constructed using selected clips from egocentric videos, with manually annotated question-answer pairs containing first-person information. To comprehensively assess VLMs, we evaluate eighteen popular VLMs on EgoThink. Moreover, given the open-ended format of the answers, we use GPT-4 as the automatic judge to compute single-answer grading. Experimental results indicate that although GPT-4V leads in numerous dimensions, all evaluated VLMs still possess considerable potential for improvement in first-person perspective tasks. Meanwhile, enlarging the number of trainable parameters has the most significant impact on model performance on EgoThink. In conclusion, EgoThink serves as a valuable addition to existing evaluation benchmarks for VLMs, providing an indispensable resource for future research in the realm of embodied artificial intelligence and robotics.
翻译:视觉-语言模型(VLM)近年来在传统下游任务中展现出令人瞩目的成果。针对其能力的评估研究不断涌现,但多数工作聚焦于第三人称视角,仅少数涉及第一人称视角的特定任务。然而,VLM能否以第一人称视角进行“思考”——这一对推动自主智能体与机器人技术至关重要的能力,仍鲜有探索。为填补这一研究空白,我们提出EgoThink——一个涵盖六大核心能力、十二个细化维度的新型视觉问答基准数据集。该基准通过选取自我中心视频片段构建,并辅以包含第一人称信息的人工标注问答对。为全面评估VLM,我们在EgoThink上对18个主流VLM进行了测试。考虑到答案的开放式格式,我们采用GPT-4作为自动评估器进行单答案评分。实验结果表明,尽管GPT-4V在多个维度表现领先,但所有被评估的VLM在第一人称视角任务中仍具有相当大的提升空间。同时,扩大可训练参数规模是提升模型在EgoThink上性能的最显著因素。总之,EgoThink为现有VLM评估基准提供了重要补充,为具身智能与机器人领域的未来研究提供了不可或缺的资源。