We address the problem of gaze target estimation, which aims to predict where a person is looking in a scene. Predicting a person's gaze target requires reasoning both about the person's appearance and the contents of the scene. Prior works have developed increasingly complex, hand-crafted pipelines for gaze target estimation that carefully fuse features from separate scene encoders, head encoders, and auxiliary models for signals like depth and pose. Motivated by the success of general-purpose feature extractors on a variety of visual tasks, we propose Gaze-LLE, a novel transformer framework that streamlines gaze target estimation by leveraging features from a frozen DINOv2 encoder. We extract a single feature representation for the scene, and apply a person-specific positional prompt to decode gaze with a lightweight module. We demonstrate state-of-the-art performance across several gaze benchmarks and provide extensive analysis to validate our design choices. Our code is available at: http://github.com/fkryan/gazelle .
翻译:我们致力于解决注视目标估计问题,其旨在预测一个人在场景中注视的位置。预测一个人的注视目标需要同时推理该人的外观和场景内容。先前的研究已经开发出日益复杂、手工设计的注视目标估计流程,这些流程精心融合了来自独立场景编码器、头部编码器以及用于深度和姿态等信号的辅助模型的特征。受通用特征提取器在各种视觉任务上取得成功的启发,我们提出了Gaze-LLE,这是一种新颖的Transformer框架,它通过利用冻结的DINOv2编码器的特征来简化注视目标估计。我们为场景提取单一特征表示,并应用特定于人物的位置提示,通过一个轻量级模块解码注视方向。我们在多个注视基准测试中展示了最先进的性能,并提供了广泛的分析以验证我们的设计选择。我们的代码可在以下网址获取:http://github.com/fkryan/gazelle 。