Contextual cues related to a person's pose and interactions with objects and other people in the scene can provide valuable information for gaze following. While existing methods have focused on dedicated cue extraction methods, in this work we investigate the zero-shot capabilities of Vision-Language Models (VLMs) for extracting a wide array of contextual cues to improve gaze following performance. We first evaluate various VLMs, prompting strategies, and in-context learning (ICL) techniques for zero-shot cue recognition performance. We then use these insights to extract contextual cues for gaze following, and investigate their impact when incorporated into a state of the art model for the task. Our analysis indicates that BLIP-2 is the overall top performing VLM and that ICL can improve performance. We also observe that VLMs are sensitive to the choice of the text prompt although ensembling over multiple text prompts can provide more robust performance. Additionally, we discover that using the entire image along with an ellipse drawn around the target person is the most effective strategy for visual prompting. For gaze following, incorporating the extracted cues results in better generalization performance, especially when considering a larger set of cues, highlighting the potential of this approach.
翻译:与人物姿态及其与场景中物体和他人互动相关的上下文线索能为视线追踪提供宝贵信息。尽管现有方法主要集中于专用的线索提取技术,本研究探索了视觉语言模型(VLMs)在提取广泛上下文线索以提升视线追踪性能方面的零样本能力。我们首先评估了多种VLM模型、提示策略及上下文学习(ICL)技术在零样本线索识别任务中的表现。基于这些发现,我们提取了用于视线追踪的上下文线索,并研究了将其整合到该任务前沿模型中所产生的影响。分析表明,BLIP-2是整体性能最优的VLM模型,且上下文学习能有效提升识别性能。同时我们发现,尽管集成多个文本提示能获得更稳健的性能,VLM对文本提示的选择仍较为敏感。此外,研究证实采用完整图像并结合围绕目标人物绘制的椭圆区域是最有效的视觉提示策略。在视线追踪任务中,融入提取的线索能带来更好的泛化性能,特别是在考虑更广泛线索集合时,这凸显了该方法的潜在价值。