Predicting human gaze is important in Human-Computer Interaction (HCI). However, to practically serve HCI applications, gaze prediction models must be scalable, fast, and accurate in their spatial and temporal gaze predictions. Recent scanpath prediction models focus on goal-directed attention (search). Such models are limited in their application due to a common approach relying on trained target detectors for all possible objects, and the availability of human gaze data for their training (both not scalable). In response, we pose a new task called ZeroGaze, a new variant of zero-shot learning where gaze is predicted for never-before-searched objects, and we develop a novel model, Gazeformer, to solve the ZeroGaze problem. In contrast to existing methods using object detector modules, Gazeformer encodes the target using a natural language model, thus leveraging semantic similarities in scanpath prediction. We use a transformer-based encoder-decoder architecture because transformers are particularly useful for generating contextual representations. Gazeformer surpasses other models by a large margin on the ZeroGaze setting. It also outperforms existing target-detection models on standard gaze prediction for both target-present and target-absent search tasks. In addition to its improved performance, Gazeformer is more than five times faster than the state-of-the-art target-present visual search model.
翻译:预测人类注视在人机交互(HCI)领域具有重要意义。然而,为实际服务于HCI应用,注视预测模型必须在空间和时间注视预测上具备可扩展性、快速性和准确性。当前的扫描路径预测模型主要聚焦于目标导向注意力(搜索)。这类模型因普遍依赖针对所有可能物体的预训练目标检测器以及需要人类注视数据进行训练(两者均不可扩展),导致其应用受限。为此,我们提出一项新任务——ZeroGaze,这是零样本学习的新变体,要求对从未搜索过的物体进行注视预测。我们开发了新型模型Gazeformer来解决ZeroGaze问题。与现有采用目标检测模块的方法不同,Gazeformer使用自然语言模型对目标进行编码,从而在扫描路径预测中利用语义相似性。我们采用基于Transformer的编码器-解码器架构,因为Transformer特别适用于生成上下文表示。Gazeformer在ZeroGaze设置下以显著优势超越其他模型。在标准注视预测任务中,无论是目标存在还是目标缺失的搜索任务,Gazeformer均优于现有目标检测模型。除了性能提升外,Gazeformer的运算速度比当前最先进的目标存在视觉搜索模型快五倍以上。