Video saliency prediction and detection are thriving research domains that enable computers to simulate the distribution of visual attention akin to how humans perceiving dynamic scenes. While many approaches have crafted task-specific training paradigms for either video saliency prediction or video salient object detection tasks, few attention has been devoted to devising a generalized saliency modeling framework that seamlessly bridges both these distinct tasks. In this study, we introduce the Unified Saliency Transformer (UniST) framework, which comprehensively utilizes the essential attributes of video saliency prediction and video salient object detection. In addition to extracting representations of frame sequences, a saliency-aware transformer is designed to learn the spatio-temporal representations at progressively increased resolutions, while incorporating effective cross-scale saliency information to produce a robust representation. Furthermore, a task-specific decoder is proposed to perform the final prediction for each task. To the best of our knowledge, this is the first work that explores designing a transformer structure for both saliency modeling tasks. Convincible experiments demonstrate that the proposed UniST achieves superior performance across seven challenging benchmarks for two tasks, and significantly outperforms the other state-of-the-art methods.
翻译:视频显著性预测与检测是蓬勃发展的研究领域,使计算机能够模拟人类感知动态场景时的视觉注意力分布。尽管已有众多方法针对视频显著性预测或视频显著目标检测任务设计了专用训练范式,但鲜有研究致力于构建一个能无缝衔接这两类不同任务的通用显著性建模框架。本研究提出了统一显著性Transformer(UniST)框架,该框架综合利用了视频显著性预测与视频显著目标检测的核心属性。除了提取帧序列表征外,我们还设计了一种显著性感知Transformer,以渐进式增高的分辨率学习时空表征,同时融合有效的跨尺度显著性信息以生成鲁棒表征。此外,提出了一个任务特定解码器,用于为每项任务执行最终预测。据我们所知,这是首次探索为两类显著性建模任务设计Transformer结构的工作。令人信服的实验表明,所提出的UniST在两项任务的七个具有挑战性的基准中均取得了优越性能,显著优于其他现有最优方法。