Predicting human gaze is important in Human-Computer Interaction (HCI). However, to practically serve HCI applications, gaze prediction models must be scalable, fast, and accurate in their spatial and temporal gaze predictions. Recent scanpath prediction models focus on goal-directed attention (search). Such models are limited in their application due to a common approach relying on trained target detectors for all possible objects, and the availability of human gaze data for their training (both not scalable). In response, we pose a new task called ZeroGaze, a new variant of zero-shot learning where gaze is predicted for never-before-searched objects, and we develop a novel model, Gazeformer, to solve the ZeroGaze problem. In contrast to existing methods using object detector modules, Gazeformer encodes the target using a natural language model, thus leveraging semantic similarities in scanpath prediction. We use a transformer-based encoder-decoder architecture because transformers are particularly useful for generating contextual representations. Gazeformer surpasses other models by a large margin on the ZeroGaze setting. It also outperforms existing target-detection models on standard gaze prediction for both target-present and target-absent search tasks. In addition to its improved performance, Gazeformer is more than five times faster than the state-of-the-art target-present visual search model.
翻译:预测人类注视在人机交互中具有重要意义。然而,为实际服务于人机交互应用,注视预测模型必须在空间和时间维度上具备可扩展性、快速性和准确性。现有扫描路径预测模型主要关注目标导向注意力(搜索)。此类模型在实际应用中受限于其通用方法——需为所有可能物体训练目标检测器,且依赖人类注视数据的可用性进行训练(两者均不可扩展)。为此,我们提出了一项名为ZeroGaze的新任务,这是零样本学习的一种新变体,旨在预测从未被搜索过的物体的注视行为,并开发了名为Gazeformer的新模型以解决ZeroGaze问题。与使用目标检测器模块的现有方法不同,Gazeformer通过自然语言模型对目标进行编码,从而在扫描路径预测中利用语义相似性。我们采用基于Transformer的编码器-解码器架构,因为Transformer在生成上下文表征方面尤为有效。Gazeformer在ZeroGaze设置下以显著优势超越其他模型。在标准注视预测任务中(包括目标存在与目标缺失搜索),它同样优于现有目标检测模型。除性能提升外,Gazeformer的速度比当前最优的目标存在视觉搜索模型快五倍以上。