HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model

Current video-language models (VLMs) rely extensively on instance-level alignment between video and language modalities, which presents two major limitations: (1) visual reasoning disobeys the natural perception that humans do in first-person perspective, leading to a lack of reasoning interpretation; and (2) learning is limited in capturing inherent fine-grained relationships between two modalities. In this paper, we take an inspiration from human perception and explore a compositional approach for egocentric video representation. We introduce HENASY (Hierarchical ENtities ASsemblY), which includes a spatiotemporal token grouping mechanism to explicitly assemble dynamically evolving scene entities through time and model their relationship for video representation. By leveraging compositional structure understanding, HENASY possesses strong interpretability via visual grounding with free-form text queries. We further explore a suite of multi-grained contrastive losses to facilitate entity-centric understandings. This comprises three alignment types: video-narration, noun-entity, verb-entities alignments. Our method demonstrates strong interpretability in both quantitative and qualitative experiments; while maintaining competitive performances on five downstream tasks via zero-shot transfer or as video/text representation, including video/text retrieval, action recognition, multi-choice query, natural language query, and moments query.

翻译：当前视频语言模型（VLM）广泛依赖于视频与语言模态之间的实例级对齐，这存在两大主要局限：（1）视觉推理违背了人类在第一人称视角下的自然感知方式，导致推理可解释性不足；（2）学习过程难以捕捉两个模态之间固有的细粒度关联。本文受人类感知机制启发，探索一种面向第一人称视频表征的组合式方法。我们提出HENASY（层次化实体组装框架），其包含一种时空令牌分组机制，能够显式地组装随时间动态演化的场景实体，并建模其关系以形成视频表征。通过利用组合结构理解，HENASY能够通过自由文本查询的视觉定位实现强可解释性。我们进一步探索了一套多粒度对比损失函数以促进以实体为中心的理解，包含三种对齐类型：视频-叙述对齐、名词-实体对齐、动词-实体对齐。定量与定性实验均表明我们的方法具有显著的可解释性；同时在零样本迁移或作为视频/文本表征时，在五项下游任务中保持竞争力，包括视频/文本检索、动作识别、多选查询、自然语言查询和时刻查询。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日