Over the past decade, visual gaze estimation has garnered increasing attention within the research community, owing to its wide-ranging application scenarios. While existing estimation approaches have achieved remarkable success in enhancing prediction accuracy, they primarily infer gaze from single-image signals, neglecting the potential benefits of the currently dominant text guidance. Notably, visual-language collaboration has been extensively explored across various visual tasks, such as image synthesis and manipulation, leveraging the remarkable transferability of large-scale Contrastive Language-Image Pre-training (CLIP) model. Nevertheless, existing gaze estimation approaches overlook the rich semantic cues conveyed by linguistic signals and the priors embedded in CLIP feature space, thereby yielding performance setbacks. To address this gap, we delve deeply into the text-eye collaboration protocol and introduce a novel gaze estimation framework, named GazeCLIP. Specifically, we intricately design a linguistic description generator to produce text signals with coarse directional cues. Additionally, a CLIP-based backbone that excels in characterizing text-eye pairs for gaze estimation is presented. This is followed by the implementation of a fine-grained multi-modal fusion module aimed at modeling the interrelationships between heterogeneous inputs. Extensive experiments on three challenging datasets demonstrate the superiority of the proposed GazeCLIP which achieves the state-of-the-art accuracy.
翻译:过去十年间,视觉凝视估计因其广泛的应用场景而受到研究界日益关注。尽管现有估计方法在提升预测精度方面取得了显著成功,但它们主要从单图像信号推断凝视方向,忽视了当前主导性文本引导的潜在优势。值得注意的是,视觉-语言协同已被广泛探索于各类视觉任务(如图像合成与编辑),得益于大规模对比语言-图像预训练(CLIP)模型卓越的可迁移性。然而,现有凝视估计方法忽略了语言信号所蕴含的丰富语义线索以及CLIP特征空间中的先验知识,导致性能受限。针对这一不足,我们深入探究了文本-眼睛协同机制,并提出了一种名为GazeCLIP的新型凝视估计框架。具体而言,我们精心设计了语言描述生成器以生成包含粗略方向线索的文本信号;同时提出了基于CLIP的主干网络,擅长为凝视估计描述文本-眼睛对特征;随后引入细粒度多模态融合模块,旨在建模异构输入间的交互关系。在三个具有挑战性的数据集上的大量实验表明,所提出的GazeCLIP实现了最先进的精度,验证了其优越性。