Predicting where a person is looking is a complex task, requiring to understand not only the person's gaze and scene content, but also the 3D scene structure and the person's situation (are they manipulating? interacting or observing others? attentive?) to detect obstructions in the line of sight or apply attention priors that humans typically have when observing others. In this paper, we hypothesize that identifying and leveraging such priors can be better achieved through the exploitation of explicitly derived multimodal cues such as depth and pose. We thus propose a modular multimodal architecture allowing to combine these cues using an attention mechanism. The architecture can naturally be exploited in privacy-sensitive situations such as surveillance and health, where personally identifiable information cannot be released. We perform extensive experiments on the GazeFollow and VideoAttentionTarget public datasets, obtaining state-of-the-art performance and demonstrating very competitive results in the privacy setting case.
翻译:预测一个人的注视方向是一项复杂任务,不仅需要理解其视线方向与场景内容,还需解析三维场景结构及个体情境(如是否正在操作物体?与他人互动或观察他人?或处于专注状态?)以检测视线中的障碍物,或应用人类观察他人时通常具备的注意力先验。本文假设:通过显式提取深度、姿态等多模态线索,能更有效地识别并利用此类先验。为此,我们提出一种模块化多模态架构,通过注意力机制融合这些线索。该架构可自然应用于隐私敏感场景(如监控与医疗),在此类场景中不得泄露个人身份信息。我们在GazeFollow和VideoAttentionTarget公开数据集上进行了大量实验,取得了当前最优性能,并在隐私保护场景中展现出极具竞争力的结果。