Saliency Prediction aims to predict the attention distribution of human eyes given an RGB image. Most of the recent state-of-the-art methods are based on deep image feature representations from traditional CNNs. However, the traditional convolution could not capture the global features of the image well due to its small kernel size. Besides, the high-level factors which closely correlate to human visual perception, e.g., objects, color, light, etc., are not considered. Inspired by these, we propose a Transformer-based method with semantic segmentation as another learning objective. More global cues of the image could be captured by Transformer. In addition, simultaneously learning the object segmentation simulates the human visual perception, which we would verify in our investigation of human gaze control in cognitive science. We build an extra decoder for the subtask and the multiple tasks share the same Transformer encoder, forcing it to learn from multiple feature spaces. We find in practice simply adding the subtask might confuse the main task learning, hence Multi-task Attention Module is proposed to deal with the feature interaction between the multiple learning targets. Our method achieves competitive performance compared to other state-of-the-art methods.
翻译:显著性预测旨在根据RGB图像预测人眼注意力的分布。目前大多数最先进的方法基于传统卷积神经网络(CNN)的深度图像特征表示。然而,传统卷积由于其较小的核尺寸,难以充分捕捉图像的全局特征。此外,与人眼视觉感知密切相关的高层因素,如物体、颜色、光照等,并未得到充分考虑。受此启发,我们提出一种基于Transformer的方法,并将语义分割作为另一学习目标。Transformer能够捕捉图像的更多全局线索。同时,同步学习物体分割模拟了人眼视觉感知过程,我们将通过在人眼注视控制认知科学中的研究加以验证。我们为子任务构建额外的解码器,多个任务共享同一Transformer编码器,迫使其从多个特征空间学习。实践中发现,简单添加子任务可能导致主任务学习混乱,因此提出多任务注意力模块(Multi-task Attention Module)以处理多个学习目标之间的特征交互。与其他最先进方法相比,我们的方法取得了具有竞争力的性能。