We consider the problem of transferring a temporal action segmentation system initially designed for exocentric (fixed) cameras to an egocentric scenario, where wearable cameras capture video data. The conventional supervised approach requires the collection and labeling of a new set of egocentric videos to adapt the model, which is costly and time-consuming. Instead, we propose a novel methodology which performs the adaptation leveraging existing labeled exocentric videos and a new set of unlabeled, synchronized exocentric-egocentric video pairs, for which temporal action segmentation annotations do not need to be collected. We implement the proposed methodology with an approach based on knowledge distillation, which we investigate both at the feature and model level. To evaluate our approach, we introduce a new benchmark based on the Assembly101 dataset. Results demonstrate the feasibility and effectiveness of the proposed method against classic unsupervised domain adaptation and temporal sequence alignment approaches. Remarkably, without bells and whistles, our best model performs on par with supervised approaches trained on labeled egocentric data, without ever seeing a single egocentric label, achieving a +15.99% (28.59% vs 12.60%) improvement in the edit score on the Assembly101 dataset compared to a baseline model trained solely on exocentric data.
翻译:本文研究将最初为外中心(固定)摄像头设计的时间动作分割系统迁移到自我中心场景的问题,在该场景中可穿戴摄像头捕获视频数据。传统监督学习方法需要收集并标注新的自我中心视频集以适配模型,成本高昂且耗时。为此,我们提出一种新方法,利用已标注的外中心视频和一组新的无标签同步外中心-自我中心视频对(无需收集时间动作分割标注)实现模型适配。我们通过基于知识蒸馏的方法实现该框架,并在特征层和模型层分别进行探究。为评估方法有效性,我们基于Assembly101数据集构建了新基准。结果表明,相较于经典无监督域适应和时序序列对齐方法,本方法具有可行性与有效性。值得注意的是,无需任何花哨技巧,我们的最优模型在从未见过任何自我中心标签的情况下,性能与基于标注数据训练的监督方法持平,在Assembly101数据集上的编辑分数相比纯外中心数据训练的基线模型提升15.99%(28.59% vs 12.60%)。