Humans naturally perceive surrounding scenes by unifying sound and sight in a first-person view. Likewise, machines are advanced to approach human intelligence by learning with multisensory inputs from an egocentric perspective. In this paper, we explore the challenging egocentric audio-visual object localization task and observe that 1) egomotion commonly exists in first-person recordings, even within a short duration; 2) The out-of-view sound components can be created while wearers shift their attention. To address the first problem, we propose a geometry-aware temporal aggregation module to handle the egomotion explicitly. The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations. Moreover, we propose a cascaded feature enhancement module to tackle the second issue. It improves cross-modal localization robustness by disentangling visually-indicated audio representation. During training, we take advantage of the naturally available audio-visual temporal synchronization as the ``free'' self-supervision to avoid costly labeling. We also annotate and create the Epic Sounding Object dataset for evaluation purposes. Extensive experiments show that our method achieves state-of-the-art localization performance in egocentric videos and can be generalized to diverse audio-visual scenes.
翻译:人类通过第一人称视角自然地将声音与视觉统一来感知周围场景。类似地,机器通过从自我中心视角学习多感官输入来接近人类智能。本文探索了具有挑战性的自我中心视听对象定位任务,并观察到:1)自我运动普遍存在于第一人称记录中,即使在短时间间隔内;2)当佩戴者转移注意力时,可能产生视野外的声音成分。针对第一个问题,我们提出几何感知时域聚合模块来显式处理自我运动。通过估计时域几何变换并利用其更新视觉表征,可缓解自我运动的影响。此外,我们提出级联特征增强模块来解决第二个问题。通过解耦视觉指示的音频表征,该模块提升了跨模态定位鲁棒性。训练过程中,我们利用自然存在的视听时域同步作为“免费”的自监督信号,以避免昂贵的标注成本。我们还标注并创建了Epic Sounding Object数据集用于评估。大量实验表明,我们的方法在自我中心视频中达到了最先进的定位性能,并能泛化到多样的视听场景。