The egocentric and exocentric viewpoints of a human activity look dramatically different, yet invariant representations to link them are essential for many potential applications in robotics and augmented reality. Prior work is limited to learning view-invariant features from paired synchronized viewpoints. We relax that strong data assumption and propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time, even when not captured simultaneously or in the same environment. To this end, we propose AE2, a self-supervised embedding approach with two key designs: (1) an object-centric encoder that explicitly focuses on regions corresponding to hands and active objects; (2) a contrastive-based alignment objective that leverages temporally reversed frames as negative samples. For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context, comprising four datasets -- including an ego tennis forehand dataset we collected, along with dense per-frame labels we annotated for each dataset. On the four datasets, our AE2 method strongly outperforms prior work in a variety of fine-grained downstream tasks, both in regular and cross-view settings.
翻译:人类活动的第一人称视角与第三人称视角呈现出显著差异,但建立连接两者的不变表征对于机器人技术和增强现实领域的诸多潜在应用至关重要。现有研究局限于从配对的同步视角中学习视角不变特征。我们放宽了这一强数据假设,提出通过时间对齐第一人称与第三人称视频来学习对视角不变的细粒度动作特征,即使这些视频并非同时拍摄或处于相同环境。为此,我们提出AE2——一种自监督嵌入方法,包含两个关键设计:(1) 面向对象的编码器,显式聚焦于手部及活动对象对应的区域;(2) 基于对比的对齐目标,利用时间反转帧作为负样本。为进行评估,我们建立了第一人称-第三人称情境下细粒度视频理解的基准测试,包含四个数据集——包括我们自采集的网球正手击球第一人称数据集,以及为每个数据集标注的密集逐帧标签。在四个数据集上,我们的AE2方法在常规视角和跨视角场景下均显著优于现有工作,实现了多种细粒度下游任务的性能提升。