Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos

We propose a novel benchmark for cross-view knowledge transfer of dense video captioning, adapting models from web instructional videos with exocentric views to an egocentric view. While dense video captioning (predicting time segments and their captions) is primarily studied with exocentric videos (e.g., YouCook2), benchmarks with egocentric videos are restricted due to data scarcity. To overcome the limited video availability, transferring knowledge from abundant exocentric web videos is demanded as a practical approach. However, learning the correspondence between exocentric and egocentric views is difficult due to their dynamic view changes. The web videos contain mixed views focusing on either human body actions or close-up hand-object interactions, while the egocentric view is constantly shifting as the camera wearer moves. This necessitates the in-depth study of cross-view transfer under complex view changes. In this work, we first create a real-life egocentric dataset (EgoYC2) whose captions are shared with YouCook2, enabling transfer learning between these datasets assuming their ground-truth is accessible. To bridge the view gaps, we propose a view-invariant learning method using adversarial training in both the pre-training and fine-tuning stages. While the pre-training is designed to learn invariant features against the mixed views in the web videos, the view-invariant fine-tuning further mitigates the view gaps between both datasets. We validate our proposed method by studying how effectively it overcomes the view change problem and efficiently transfers the knowledge to the egocentric domain. Our benchmark pushes the study of the cross-view transfer into a new task domain of dense video captioning and will envision methodologies to describe egocentric videos in natural language.

翻译：我们提出了一种用于密集视频描述任务的跨视角知识迁移新基准，旨在将从第三人称视角的网络教学视频中训练的模型适配至第一人称视角。尽管密集视频描述（预测时间片段及其对应描述）主要围绕第三人称视角视频（如YouCook2）展开研究，但受限于数据稀缺性，第一人称视角视频的基准研究十分有限。为克服视频可用性不足的问题，从丰富的第三人称网络视频中迁移知识成为切实可行的途径。然而，由于视角动态变化，学习第一人称与第三人称视角之间的对应关系具有挑战性：网络视频包含聚焦人体动作或手部物体交互的特写混合视角，而第一人称视角会随佩戴者移动持续变化。这需要在复杂视角变化下深入研究跨视角迁移问题。本工作中，我们首先构建了一个真实场景的第一人称数据集（EgoYC2），其描述与YouCook2共享，从而支持假设真实标注可获取时的跨数据集迁移学习。为弥合视角差异，我们提出了一种基于对抗训练的视角不变学习方法，分别在预训练和微调阶段实施：预训练阶段旨在学习对网络视频中混合视角的不变特征，而视角不变微调则进一步缩小两个数据集间的视角差距。我们通过评估该方法克服视角变化问题及高效迁移知识至第一人称领域的效果来验证其有效性。本基准将跨视角迁移研究拓展至密集视频描述这一新任务领域，并为以自然语言描述第一人称视频的方法提供前瞻性指导。