Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos

We propose a novel benchmark for cross-view knowledge transfer of dense video captioning, adapting models from web instructional videos with exocentric views to an egocentric view. While dense video captioning (predicting time segments and their captions) is primarily studied with exocentric videos (e.g., YouCook2), benchmarks with egocentric videos are restricted due to data scarcity. To overcome the limited video availability, transferring knowledge from abundant exocentric web videos is demanded as a practical approach. However, learning the correspondence between exocentric and egocentric views is difficult due to their dynamic view changes. The web videos contain mixed views focusing on either human body actions or close-up hand-object interactions, while the egocentric view is constantly shifting as the camera wearer moves. This necessitates the in-depth study of cross-view transfer under complex view changes. In this work, we first create a real-life egocentric dataset (EgoYC2) whose captions are shared with YouCook2, enabling transfer learning between these datasets assuming their ground-truth is accessible. To bridge the view gaps, we propose a view-invariant learning method using adversarial training in both the pre-training and fine-tuning stages. While the pre-training is designed to learn invariant features against the mixed views in the web videos, the view-invariant fine-tuning further mitigates the view gaps between both datasets. We validate our proposed method by studying how effectively it overcomes the view change problem and efficiently transfers the knowledge to the egocentric domain. Our benchmark pushes the study of the cross-view transfer into a new task domain of dense video captioning and will envision methodologies to describe egocentric videos in natural language.

翻译：本文提出了一种面向密集视频描述任务中跨视角知识迁移的新型基准，旨在将模型从具有外中心视角的网络教学视频适配至自我中心视角。尽管密集视频描述（预测时间片段及其描述）主要以外中心视角视频（如YouCook2）为研究场景，但由于数据稀缺性，自我中心视频领域存在严重的基准限制。为突破视频数据有限的困境，利用丰富的自中心视角网络视频进行知识迁移成为一种实用方案。然而，由于视角动态变化，学习外中心视角与自我中心视角间的对应关系极具挑战性：网络视频包含混合视角，其主要聚焦人体动作或手-物精细交互，而自我中心视角则随佩戴者移动持续变化。这种复杂性要求深入探索复杂视角变化下的跨视角迁移机制。本研究首先构建了一个真实场景下的自我中心数据集（EgoYC2），其标注描述与YouCook2共享，假设两类数据集均可获取真实标注，从而支持跨数据集迁移学习。为弥合视角差异，我们提出了一种基于对抗训练的视角不变学习框架，分别应用于预训练与微调阶段：预训练阶段旨在学习对网络视频混合视角具有不变性的特征，而视角不变微调进一步消除两类数据集间的视角差距。通过评估该方法克服视角变化问题及向自我中心领域高效迁移知识的能力，验证了其有效性。本基准将跨视角迁移研究推入密集视频描述这一新任务领域，为以自然语言描述自我中心视频的方法提供创新思路。