Video saliency prediction aims to identify the regions in a video that attract human attention and gaze, driven by bottom-up features from the video and top-down processes like memory and cognition. Among these top-down influences, language plays a crucial role in guiding attention by shaping how visual information is interpreted. Existing methods primarily focus on modeling perceptual information while neglecting the reasoning process facilitated by language, where ranking cues are crucial outcomes of this process and practical guidance for saliency prediction. In this paper, we propose CaRDiff (Caption, Rank, and generate with Diffusion), a framework that imitates the process by integrating a multimodal large language model (MLLM), a grounding module, and a diffusion model, to enhance video saliency prediction. Specifically, we introduce a novel prompting method VSOR-CoT (Video Salient Object Ranking Chain of Thought), which utilizes an MLLM with a grounding module to caption video content and infer salient objects along with their rankings and positions. This process derives ranking maps that can be sufficiently leveraged by the diffusion model to decode the saliency maps for the given video accurately. Extensive experiments show the effectiveness of VSOR-CoT in improving the performance of video saliency prediction. The proposed CaRDiff performs better than state-of-the-art models on the MVS dataset and demonstrates cross-dataset capabilities on the DHF1k dataset through zero-shot evaluation.
翻译:视频显著性预测旨在识别视频中吸引人类注意力和注视的区域,其驱动力既来自视频的自底向上特征,也源于记忆与认知等自顶向下过程。在这些自顶向下的影响因素中,语言通过塑造视觉信息的解释方式,在引导注意力方面发挥着关键作用。现有方法主要侧重于对感知信息进行建模,而忽视了语言所促进的推理过程,其中排序线索是该过程的关键产物,也是显著性预测的实际指导。本文提出CaRDiff(基于扩散模型的描述、排序与生成)框架,通过整合多模态大语言模型(MLLM)、接地模块和扩散模型来模拟该过程,以增强视频显著性预测。具体而言,我们提出一种新颖的提示方法VSOR-CoT(视频显著对象排序思维链),该方法利用配备接地模块的MLLM对视频内容进行描述,并推断显著对象及其排序与位置。此过程生成的排序图可被扩散模型充分用于准确解码给定视频的显著性图。大量实验证明了VSOR-CoT在提升视频显著性预测性能方面的有效性。所提出的CaRDiff在MVS数据集上优于现有最先进模型,并通过零样本评估在DHF1k数据集上展现了跨数据集泛化能力。