Gaze following aims to interpret human-scene interactions by predicting the person's focal point of gaze. Prevailing approaches often use multi-modality inputs, most of which adopt a two-stage framework. Hence their performance highly depends on the previous prediction accuracy. Others use a single-modality approach with complex decoders, increasing network computational load. Inspired by the remarkable success of pre-trained plain Vision Transformers (ViTs), we introduce a novel single-modality gaze following framework, ViTGaze. In contrast to previous methods, ViTGaze creates a brand new gaze following framework based mainly on powerful encoders (dec. param. less than 1%). Our principal insight lies in that the inter-token interactions within self-attention can be transferred to interactions between humans and scenes. Leveraging this presumption, we formulate a framework consisting of a 4D interaction encoder and a 2D spatial guidance module to extract human-scene interaction information from self-attention maps. Furthermore, our investigation reveals that ViT with self-supervised pre-training exhibits an enhanced ability to extract correlated information. A large number of experiments have been conducted to demonstrate the performance of the proposed method. Our method achieves state-of-the-art (SOTA) performance among all single-modality methods (3.4% improvement on AUC, 5.1% improvement on AP) and very comparable performance against multi-modality methods with 59% number of parameters less.
翻译:视线追踪旨在通过预测人的注视焦点来解读人-场景交互。现有方法普遍采用多模态输入,其中多数使用两阶段框架,因此其性能高度依赖前阶段预测精度。另一些方法采用单模态方案但需复杂解码器,增加了网络计算负载。受预训练纯视觉Transformer(ViTs)卓越成功的启发,我们提出了一种新颖的单模态视线追踪框架ViTGaze。与先前方法不同,ViTGaze主要基于强大的编码器(解码器参数量小于1%)构建全新视线追踪框架。核心见解在于:自注意力机制中的令牌间交互可迁移至人与场景的交互。基于这一假设,我们设计了一个由4D交互编码器和2D空间引导模块组成的框架,用于从自注意力图中提取人-场景交互信息。此外,研究发现采用自监督预训练的ViT具有更强的关联信息提取能力。大量实验验证了所提方法的性能。我们的方法在所有单模态方法中达到最优水平(AUC提升3.4%,AP提升5.1%),并以较少的参数量(减少59%)取得了与多模态方法相媲美的性能。