Different video understanding tasks are typically treated in isolation, and even with distinct types of curated data (e.g., classifying sports in one dataset, tracking animals in another). However, in wearable cameras, the immersive egocentric perspective of a person engaging with the world around them presents an interconnected web of video understanding tasks -- hand-object manipulations, navigation in the space, or human-human interactions -- that unfold continuously, driven by the person's goals. We argue that this calls for a much more unified approach. We propose EgoTask Translation (EgoT2), which takes a collection of models optimized on separate tasks and learns to translate their outputs for improved performance on any or all of them at once. Unlike traditional transfer or multi-task learning, EgoT2's flipped design entails separate task-specific backbones and a task translator shared across all tasks, which captures synergies between even heterogeneous tasks and mitigates task competition. Demonstrating our model on a wide array of video tasks from Ego4D, we show its advantages over existing transfer paradigms and achieve top-ranked results on four of the Ego4D 2022 benchmark challenges.
翻译:不同视频理解任务通常被孤立地处理,甚至需要不同类型的人工标注数据(例如,在一个数据集中对体育动作进行分类,在另一个数据集中追踪动物运动)。然而,在可穿戴摄像头中,沉浸式的以自我为中心的视角呈现了一个由视频理解任务(如手物交互、空间导航或人际互动)构成的相互关联的网络,这些任务由人的目标驱动而持续展开。我们认为这需要一种更统一的处理方法。本文提出EgoTask Translation (EgoT2),该方法采用一组针对不同任务优化过的模型,并学习将其输出进行翻译,以同时提升其中任意或全部任务的性能。与传统迁移学习或多任务学习不同,EgoT2的反向设计包含独立的任务特定主干网络以及一个跨所有任务共享的任务翻译器,这能捕捉异构任务间的协同效应并缓解任务竞争。通过在Ego4D的多种视频任务上验证模型,我们展示了其相较于现有迁移范式的优势,并在Ego4D 2022基准挑战的四个任务上取得了领先成绩。