In this paper, we deal with the task of text-driven saliency detection in 360-degrees videos. For this, we introduce the TSV360 dataset which includes 16,000 triplets of ERP frames, textual descriptions of salient objects/events in these frames, and the associated ground-truth saliency maps. Following, we extend and adapt a SOTA visual-based approach for 360-degrees video saliency detection, and develop the TSalV360 method that takes into account a user-provided text description of the desired objects and/or events. This method leverages a SOTA vision-language model for data representation and integrates a similarity estimation module and a viewport spatio-temporal cross-attention mechanism, to discover dependencies between the different data modalities. Quantitative and qualitative evaluations using the TSV360 dataset, showed the competitiveness of TSalV360 compared to a SOTA visual-based approach and documented its competency to perform customized text-driven saliency detection in 360-degrees videos.
翻译:本文研究360度视频中的文本驱动显著性检测任务。为此,我们提出了TSV360数据集,该数据集包含16,000组三元数据:ERP帧、对这些帧中显著物体/事件的文本描述,以及对应的真实显著性标注图。随后,我们扩展并适配了一种基于视觉的360度视频显著性检测先进方法,开发出TSalV360方法。该方法能够考虑用户提供的关于目标物体和/或事件的文本描述,利用先进的视觉-语言模型进行数据表征,并集成相似度估计模块与视口时空交叉注意力机制,以发掘不同数据模态间的依赖关系。基于TSV360数据集的定量与定性评估表明,相较于先进的纯视觉方法,TSalV360具有竞争力,并验证了其在360度视频中执行定制化文本驱动显著性检测的能力。