We consider the problem of transferring a temporal action segmentation system initially designed for exocentric (fixed) cameras to an egocentric scenario, where wearable cameras capture video data. The conventional supervised approach requires the collection and labeling of a new set of egocentric videos to adapt the model, which is costly and time-consuming. Instead, we propose a novel methodology which performs the adaptation leveraging existing labeled exocentric videos and a new set of unlabeled, synchronized exocentric-egocentric video pairs, for which temporal action segmentation annotations do not need to be collected. We implement the proposed methodology with an approach based on knowledge distillation, which we investigate both at the feature and Temporal Action Segmentation model level. Experiments on Assembly101 and EgoExo4D demonstrate the effectiveness of the proposed method against classic unsupervised domain adaptation and temporal alignment approaches. Without bells and whistles, our best model performs on par with supervised approaches trained on labeled egocentric data, without ever seeing a single egocentric label, achieving a +15.99 improvement in the edit score (28.59 vs 12.60) on the Assembly101 dataset compared to a baseline model trained solely on exocentric data. In similar settings, our method also improves edit score by +3.32 on the challenging EgoExo4D benchmark. Code is available here: https://github.com/fpv-iplab/synchronization-is-all-you-need.
翻译:本文研究将最初为外中心(固定)摄像头设计的时序动作分割系统迁移至内中心场景的问题,其中可穿戴摄像头采集视频数据。传统的监督方法需要收集并标注新的内中心视频数据集以适配模型,这一过程成本高昂且耗时。相反,我们提出一种新颖的方法论,该方法利用现有已标注的外中心视频以及一组新的未标注、同步的外中心-内中心视频对完成适配,而无需为这些视频对收集时序动作分割标注。我们基于知识蒸馏的方法实现了所提出的方法论,并在特征层面和时序动作分割模型层面进行了探究。在Assembly101和EgoExo4D数据集上的实验证明了所提方法相对于经典无监督域适应和时序对齐方法的有效性。在不使用任何额外技巧的情况下,我们的最佳模型性能与在已标注内中心数据上训练的监督方法相当,且从未见过任何内中心标注,在Assembly101数据集上相比仅在外中心数据上训练的基线模型,编辑分数提升了+15.99(28.59对比12.60)。在类似设置下,我们的方法在具有挑战性的EgoExo4D基准测试上也实现了编辑分数+3.32的提升。代码可见:https://github.com/fpv-iplab/synchronization-is-all-you-need。