Current video-language models (VLMs) rely extensively on instance-level alignment between video and language modalities, which presents two major limitations: (1) visual reasoning disobeys the natural perception that humans do in first-person perspective, leading to a lack of reasoning interpretation; and (2) learning is limited in capturing inherent fine-grained relationships between two modalities. In this paper, we take an inspiration from human perception and explore a compositional approach for egocentric video representation. We introduce HENASY (Hierarchical ENtities ASsemblY), which includes a spatiotemporal token grouping mechanism to explicitly assemble dynamically evolving scene entities through time and model their relationship for video representation. By leveraging compositional structure understanding, HENASY possesses strong interpretability via visual grounding with free-form text queries. We further explore a suite of multi-grained contrastive losses to facilitate entity-centric understandings. This comprises three alignment types: video-narration, noun-entity, verb-entities alignments. Our method demonstrates strong interpretability in both quantitative and qualitative experiments; while maintaining competitive performances on five downstream tasks via zero-shot transfer or as video/text representation, including video/text retrieval, action recognition, multi-choice query, natural language query, and moments query.
翻译:当前视频语言模型(VLM)广泛依赖于视频与语言模态之间的实例级对齐,这存在两大主要局限:(1)视觉推理违背了人类在第一人称视角下的自然感知方式,导致推理可解释性不足;(2)学习过程难以捕捉两个模态之间固有的细粒度关联。本文受人类感知机制启发,探索一种面向第一人称视频表征的组合式方法。我们提出HENASY(层次化实体组装框架),其包含一种时空令牌分组机制,能够显式地组装随时间动态演化的场景实体,并建模其关系以形成视频表征。通过利用组合结构理解,HENASY能够通过自由文本查询的视觉定位实现强可解释性。我们进一步探索了一套多粒度对比损失函数以促进以实体为中心的理解,包含三种对齐类型:视频-叙述对齐、名词-实体对齐、动词-实体对齐。定量与定性实验均表明我们的方法具有显著的可解释性;同时在零样本迁移或作为视频/文本表征时,在五项下游任务中保持竞争力,包括视频/文本检索、动作识别、多选查询、自然语言查询和时刻查询。