VST++: Efficient and Stronger Visual Saliency Transformer

While previous CNN-based models have exhibited promising results for salient object detection (SOD), their ability to explore global long-range dependencies is restricted. Our previous work, the Visual Saliency Transformer (VST), addressed this constraint from a transformer-based sequence-to-sequence perspective, to unify RGB and RGB-D SOD. In VST, we developed a multi-task transformer decoder that concurrently predicts saliency and boundary outcomes in a pure transformer architecture. Moreover, we introduced a novel token upsampling method called reverse T2T for predicting a high-resolution saliency map effortlessly within transformer-based structures. Building upon the VST model, we further propose an efficient and stronger VST version in this work, i.e. VST++. To mitigate the computational costs of the VST model, we propose a Select-Integrate Attention (SIA) module, partitioning foreground into fine-grained segments and aggregating background information into a single coarse-grained token. To incorporate 3D depth information with low cost, we design a novel depth position encoding method tailored for depth maps. Furthermore, we introduce a token-supervised prediction loss to provide straightforward guidance for the task-related tokens. We evaluate our VST++ model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD benchmark datasets. Experimental results show that our model outperforms existing methods while achieving a 25% reduction in computational costs without significant performance compromise. The demonstrated strong ability for generalization, enhanced performance, and heightened efficiency of our VST++ model highlight its potential.

翻译：虽然先前基于CNN的模型在显著目标检测（SOD）中展现出令人期待的结果，但其探索全局长程依赖关系的能力仍受限制。我们前期工作——视觉显著性Transformer（VST）从基于Transformer的序列到序列角度解决了这一局限，实现了RGB与RGB-D SOD的统一。在VST中，我们开发了一个多任务Transformer解码器，在纯Transformer架构中同步预测显著图和边界结果。此外，我们提出了一种名为反向T2T的新型令牌上采样方法，可在基于Transformer的结构中轻松预测高分辨率显著图。基于VST模型，本文进一步提出高效且更强的VST版本，即VST++。为降低VST模型的计算成本，我们提出选择-整合注意力（SIA）模块，将前景划分为细粒度片段并将背景信息聚合为单个粗粒度令牌。为低成本融入3D深度信息，我们针对深度图设计了一种新颖的深度位置编码方法。此外，我们引入令牌监督预测损失，为任务相关令牌提供直接引导。我们在RGB、RGB-D和RGB-T SOD基准数据集上，基于多种Transformer骨干网络评估了VST++模型。实验结果表明，我们的模型在计算成本降低25%且性能未显著妥协的情况下优于现有方法。VST++模型展现出的强泛化能力、增强性能及高效性凸显了其应用潜力。