We propose a novel benchmark for cross-view knowledge transfer of dense video captioning, adapting models from web instructional videos with exocentric views to an egocentric view. While dense video captioning (predicting time segments and their captions) is primarily studied with exocentric videos (e.g., YouCook2), benchmarks with egocentric videos are restricted due to data scarcity. To overcome the limited video availability, transferring knowledge from abundant exocentric web videos is demanded as a practical approach. However, learning the correspondence between exocentric and egocentric views is difficult due to their dynamic view changes. The web videos contain shots showing either full-body or hand regions, while the egocentric view is constantly shifting. This necessitates the in-depth study of cross-view transfer under complex view changes. To this end, we first create a real-life egocentric dataset (EgoYC2) whose captions follow the definition of YouCook2 captions, enabling transfer learning between these datasets with access to their ground-truth. To bridge the view gaps, we propose a view-invariant learning method using adversarial training, which consists of pre-training and fine-tuning stages. Our experiments confirm the effectiveness of overcoming the view change problem and knowledge transfer to egocentric views. Our benchmark pushes the study of cross-view transfer into a new task domain of dense video captioning and envisions methodologies that describe egocentric videos in natural language.
翻译:我们提出了一种新颖的跨视角密集视频描述知识迁移基准,旨在将模型从具有他者视角的网络教学视频适应到自我中心视角。虽然密集视频描述(预测时间片段及其描述文本)主要在他者视角视频(如YouCook2)中进行研究,但由于数据稀缺,基于自我中心视角视频的基准测试受到限制。为克服视频可用性有限的问题,从丰富的他者视角网络视频中迁移知识成为一种实用的需求。然而,由于动态的视角变化,学习他者视角与自我中心视角之间的对应关系十分困难。网络视频包含展示全身或手部区域的镜头,而自我中心视角则持续变化。这需要在复杂视角变化下深入研究跨视角迁移。为此,我们首先创建了一个真实场景的自我中心视角数据集(EgoYC2),其描述文本遵循YouCook2的描述定义,从而能够利用其真实标注在这些数据集之间进行迁移学习。为弥合视角差异,我们提出了一种基于对抗训练的视角不变学习方法,该方法包含预训练和微调两个阶段。我们的实验证实了该方法在克服视角变化问题以及向自我中心视角进行知识迁移方面的有效性。我们的基准将跨视角迁移研究推进到密集视频描述这一新任务领域,并展望了用自然语言描述自我中心视角视频的方法论。