Gaze object prediction aims to predict the location and category of the object that is watched by a human. Previous gaze object prediction works use CNN-based object detectors to predict the object's location. However, we find that Transformer-based object detectors can predict more accurate object location for dense objects in retail scenarios. Moreover, the long-distance modeling capability of the Transformer can help to build relationships between the human head and the gaze object, which is important for the GOP task. To this end, this paper introduces Transformer into the fields of gaze object prediction and proposes an end-to-end Transformer-based gaze object prediction method named TransGOP. Specifically, TransGOP uses an off-the-shelf Transformer-based object detector to detect the location of objects and designs a Transformer-based gaze autoencoder in the gaze regressor to establish long-distance gaze relationships. Moreover, to improve gaze heatmap regression, we propose an object-to-gaze cross-attention mechanism to let the queries of the gaze autoencoder learn the global-memory position knowledge from the object detector. Finally, to make the whole framework end-to-end trained, we propose a Gaze Box loss to jointly optimize the object detector and gaze regressor by enhancing the gaze heatmap energy in the box of the gaze object. Extensive experiments on the GOO-Synth and GOO-Real datasets demonstrate that our TransGOP achieves state-of-the-art performance on all tracks, i.e., object detection, gaze estimation, and gaze object prediction. Our code will be available at https://github.com/chenxi-Guo/TransGOP.git.
翻译:视线目标预测旨在预测人类注视对象的位置和类别。以往的视线目标预测方法采用基于CNN的目标检测器来预测物体位置。然而我们发现,在零售场景中,基于Transformer的目标检测器能够更准确地预测密集物体的位置。此外,Transformer的长距离建模能力有助于建立人体头部与注视目标之间的关联关系,这对于视线目标预测任务至关重要。为此,本文首次将Transformer引入视线目标预测领域,提出了一种名为TransGOP的端到端Transformer视线目标预测方法。具体而言,TransGOP采用现成的基于Transformer的目标检测器检测物体位置,并在视线回归器中设计基于Transformer的视线自编码器以建立长距离视线关联。同时,为优化视线热图回归,我们提出了一种对象-视线交叉注意力机制,使视线自编码器的查询向量能从目标检测器中学习全局记忆位置知识。最后,为实现整个框架的端到端训练,我们提出视线框损失函数,通过增强注视目标框内的视线热图能量,联合优化目标检测器和视线回归器。在GOO-Synth和GOO-Real数据集上的大量实验表明,我们的TransGOP在目标检测、视线估计和视线目标预测三个任务上均取得了业界最佳性能。相关代码将发布在https://github.com/chenxi-Guo/TransGOP.git。