Gaze object prediction is a newly proposed task that aims to discover the objects being stared at by humans. It is of great application significance but still lacks a unified solution framework. An intuitive solution is to incorporate an object detection branch into an existing gaze prediction method. However, previous gaze prediction methods usually use two different networks to extract features from scene image and head image, which would lead to heavy network architecture and prevent each branch from joint optimization. In this paper, we build a novel framework named GaTector to tackle the gaze object prediction problem in a unified way. Particularly, a specific-general-specific (SGS) feature extractor is firstly proposed to utilize a shared backbone to extract general features for both scene and head images. To better consider the specificity of inputs and tasks, SGS introduces two input-specific blocks before the shared backbone and three task-specific blocks after the shared backbone. Specifically, a novel Defocus layer is designed to generate object-specific features for the object detection task without losing information or requiring extra computations. Moreover, the energy aggregation loss is introduced to guide the gaze heatmap to concentrate on the stared box. In the end, we propose a novel wUoC metric that can reveal the difference between boxes even when they share no overlapping area. Extensive experiments on the GOO dataset verify the superiority of our method in all three tracks, i.e. object detection, gaze estimation, and gaze object prediction.
翻译:目光目标预测是一项新提出的任务,旨在发现人类注视的物体。该任务具有重要的应用意义,但目前仍缺乏统一的解决方案框架。一个直观的方案是将目标检测分支融入现有的目光预测方法中。然而,以往的目光预测方法通常使用两个不同的网络分别从场景图像和头部图像中提取特征,这会导致网络架构臃肿,并阻碍各分支的联合优化。本文构建了一个名为GaTector的新型框架,以统一的方式解决目光目标预测问题。具体而言,我们首先提出了一种特定-通用-特定(SGS)特征提取器,利用共享主干网络为场景图像和头部图像提取通用特征。为了更好地考虑输入和任务的特殊性,SGS在共享主干网络之前引入了两个输入特定模块,并在其后引入了三个任务特定模块。其中,我们设计了一种新颖的散焦层,能够在无需信息丢失或额外计算的情况下,为目标检测任务生成物体特定特征。此外,我们引入了能量聚合损失,以引导目光热力图聚焦于注视框。最后,我们提出了一种新的wUoC度量标准,该标准能够揭示即使两个框没有重叠区域时的差异。在GOO数据集上的大量实验验证了我们的方法在所有三个任务方向(即目标检测、目光估计和目光目标预测)上的优越性。