Vision-Language Models (VLMs), pre-trained on large-scale datasets, have shown impressive performance in various visual recognition tasks. This advancement paves the way for notable performance in Zero-Shot Egocentric Action Recognition (ZS-EAR). Typically, VLMs handle ZS-EAR as a global video-text matching task, which often leads to suboptimal alignment of vision and linguistic knowledge. We propose a refined approach for ZS-EAR using VLMs, emphasizing fine-grained concept-description alignment that capitalizes on the rich semantic and contextual details in egocentric videos. In this paper, we introduce GPT4Ego, a straightforward yet remarkably potent VLM framework for ZS-EAR, designed to enhance the fine-grained alignment of concept and description between vision and language. Extensive experiments demonstrate GPT4Ego significantly outperforms existing VLMs on three large-scale egocentric video benchmarks, i.e., EPIC-KITCHENS-100 (33.2%, +9.4%), EGTEA (39.6%, +5.5%), and CharadesEgo (31.5%, +2.6%).
翻译:视觉-语言模型(VLMs)在大规模数据集上预训练后,已在多种视觉识别任务中展现出卓越性能。这一进展为零样本自我中心动作识别(ZS-EAR)领域的显著成果铺平了道路。通常,VLMs将ZS-EAR视为全局视频-文本匹配任务,这往往导致视觉与语言知识的对齐不理想。我们提出了一种利用VLMs进行ZS-EAR的精炼方法,强调利用自我中心视频中丰富的语义和上下文细节进行细粒度概念-描述对齐。在本文中,我们介绍了GPT4Ego,一种简单但极为强大的用于ZS-EAR的VLM框架,旨在增强视觉与语言之间概念和描述的细粒度对齐。大量实验表明,GPT4Ego在三个大规模自我中心视频基准(即EPIC-KITCHENS-100(33.2%,+9.4%)、EGTEA(39.6%,+5.5%)和CharadesEgo(31.5%,+2.6%))上显著优于现有VLMs。