We propose an end-to-end approach for gaze target detection: predicting a head-target connection between individuals and the target image regions they are looking at. Most of the existing methods use independent components such as off-the-shelf head detectors or have problems in establishing associations between heads and gaze targets. In contrast, we investigate an end-to-end multi-person Gaze target detection framework with Heads and Targets Association (GazeHTA), which predicts multiple head-target instances based solely on input scene image. GazeHTA addresses challenges in gaze target detection by (1) leveraging a pre-trained diffusion model to extract scene features for rich semantic understanding, (2) re-injecting a head feature to enhance the head priors for improved head understanding, and (3) learning a connection map as the explicit visual associations between heads and gaze targets. Our extensive experimental results demonstrate that GazeHTA outperforms state-of-the-art gaze target detection methods and two adapted diffusion-based baselines on two standard datasets.
翻译:我们提出一种端到端的注视目标检测方法:预测个体与他们所注视的目标图像区域之间的头-目标连接。现有方法大多使用独立组件(如现成的头部检测器),或在建立头部与注视目标之间的关联时存在问题。相比之下,我们研究了一种基于头与目标关联的端到端多人注视目标检测框架GazeHTA,该框架仅凭输入场景图像即可预测多个头-目标实例。GazeHTA通过以下方式解决注视目标检测中的挑战:(1)利用预训练的扩散模型提取场景特征以获得丰富的语义理解;(2)重新注入头部特征以增强头部先验信息,从而改进头部理解;(3)学习连接图作为头部与注视目标之间的显式视觉关联。大量实验结果表明,在两个标准数据集上,GazeHTA的性能优于最先进的注视目标检测方法及两个基于扩散模型的改进基线方法。