Gaze following aims to interpret human-scene interactions by predicting the person's focal point of gaze. Prevailing approaches often adopt a two-stage framework, whereby multi-modality information is extracted in the initial stage for gaze target prediction. Consequently, the efficacy of these methods highly depends on the precision of the preceding modality extraction. Others use a single-modality approach with complex decoders, increasing network computational load. Inspired by the remarkable success of pre-trained plain vision transformers (ViTs), we introduce a novel single-modality gaze following framework called ViTGaze. In contrast to previous methods, it creates a novel gaze following framework based mainly on powerful encoders (relative decoder parameters less than 1%). Our principal insight is that the inter-token interactions within self-attention can be transferred to interactions between humans and scenes. Leveraging this presumption, we formulate a framework consisting of a 4D interaction encoder and a 2D spatial guidance module to extract human-scene interaction information from self-attention maps. Furthermore, our investigation reveals that ViT with self-supervised pre-training has an enhanced ability to extract correlation information. Many experiments have been conducted to demonstrate the performance of the proposed method. Our method achieves state-of-the-art (SOTA) performance among all single-modality methods (3.4% improvement in the area under curve (AUC) score, 5.1% improvement in the average precision (AP)) and very comparable performance against multi-modality methods with 59% number of parameters less.
翻译:视线追踪旨在通过预测人眼的注视焦点来解析人与场景的交互关系。主流方法通常采用两阶段框架,即在初始阶段提取多模态信息以预测注视目标。因此,这些方法的效能高度依赖于前期模态提取的精度。另一些方法采用单模态方案配合复杂解码器,增加了网络计算负荷。受预训练朴素视觉Transformer(ViT)显著成功的启发,我们提出了一种名为ViTGaze的新型单模态视线追踪框架。与先前方法不同,该框架主要基于强大的编码器构建(相对解码器参数量小于1%)。我们的核心洞见是:自注意力机制中的token间交互可迁移至人与场景的交互表征。基于此假设,我们构建了由4D交互编码器与2D空间引导模块组成的框架,以从自注意力图中提取人-场景交互信息。此外,研究发现经过自监督预训练的ViT具备更强的关联信息提取能力。大量实验验证了所提方法的性能:在单模态方法中达到最优性能(曲线下面积AUC提升3.4%,平均精度AP提升5.1%),相较于多模态方法仅需41%参数量即可获得极具竞争力的性能。