TransGOP: Transformer-Based Gaze Object Prediction

Gaze object prediction aims to predict the location and category of the object that is watched by a human. Previous gaze object prediction works use CNN-based object detectors to predict the object's location. However, we find that Transformer-based object detectors can predict more accurate object location for dense objects in retail scenarios. Moreover, the long-distance modeling capability of the Transformer can help to build relationships between the human head and the gaze object, which is important for the GOP task. To this end, this paper introduces Transformer into the fields of gaze object prediction and proposes an end-to-end Transformer-based gaze object prediction method named TransGOP. Specifically, TransGOP uses an off-the-shelf Transformer-based object detector to detect the location of objects and designs a Transformer-based gaze autoencoder in the gaze regressor to establish long-distance gaze relationships. Moreover, to improve gaze heatmap regression, we propose an object-to-gaze cross-attention mechanism to let the queries of the gaze autoencoder learn the global-memory position knowledge from the object detector. Finally, to make the whole framework end-to-end trained, we propose a Gaze Box loss to jointly optimize the object detector and gaze regressor by enhancing the gaze heatmap energy in the box of the gaze object. Extensive experiments on the GOO-Synth and GOO-Real datasets demonstrate that our TransGOP achieves state-of-the-art performance on all tracks, i.e., object detection, gaze estimation, and gaze object prediction. Our code will be available at https://github.com/chenxi-Guo/TransGOP.git.

翻译：视线目标预测旨在预测人类注视对象的位置和类别。以往的视线目标预测方法采用基于CNN的目标检测器来预测物体位置。然而我们发现，在零售场景中，基于Transformer的目标检测器能够更准确地预测密集物体的位置。此外，Transformer的长距离建模能力有助于建立人体头部与注视目标之间的关联关系，这对于视线目标预测任务至关重要。为此，本文首次将Transformer引入视线目标预测领域，提出了一种名为TransGOP的端到端Transformer视线目标预测方法。具体而言，TransGOP采用现成的基于Transformer的目标检测器检测物体位置，并在视线回归器中设计基于Transformer的视线自编码器以建立长距离视线关联。同时，为优化视线热图回归，我们提出了一种对象-视线交叉注意力机制，使视线自编码器的查询向量能从目标检测器中学习全局记忆位置知识。最后，为实现整个框架的端到端训练，我们提出视线框损失函数，通过增强注视目标框内的视线热图能量，联合优化目标检测器和视线回归器。在GOO-Synth和GOO-Real数据集上的大量实验表明，我们的TransGOP在目标检测、视线估计和视线目标预测三个任务上均取得了业界最佳性能。相关代码将发布在https://github.com/chenxi-Guo/TransGOP.git。

相关内容

自编码器

关注 141

自动编码器是一种人工神经网络，用于以无监督的方式学习有效的数据编码。自动编码器的目的是通过训练网络忽略信号“噪声”来学习一组数据的表示（编码），通常用于降维。与简化方面一起，学习了重构方面，在此，自动编码器尝试从简化编码中生成尽可能接近其原始输入的表示形式，从而得到其名称。基本模型存在几种变体，其目的是迫使学习的输入表示形式具有有用的属性。自动编码器可有效地解决许多应用问题，从面部识别到获取单词的语义。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日