We present a novel approach for saliency prediction in images, leveraging parallel decoding in transformers to learn saliency solely from fixation maps. Models typically rely on continuous saliency maps, to overcome the difficulty of optimizing for the discrete fixation map. We attempt to replicate the experimental setup that generates saliency datasets. Our approach treats saliency prediction as a direct set prediction problem, via a global loss that enforces unique fixations prediction through bipartite matching and a transformer encoder-decoder architecture. By utilizing a fixed set of learned fixation queries, the cross-attention reasons over the image features to directly output the fixation points, distinguishing it from other modern saliency predictors. Our approach, named Saliency TRansformer (SalTR), achieves metric scores on par with state-of-the-art approaches on the Salicon and MIT300 benchmarks.
翻译:我们提出了一种新颖的图像显著性预测方法,利用Transformer中的并行解码,仅从注视点图学习显著性。现有模型通常依赖连续显著性图来克服离散注视点图优化困难的问题。我们尝试复现生成显著性数据集时的实验设置。该方法将显著性预测视为直接的集合预测问题,通过基于二分图匹配的全局损失强制执行唯一注视点预测,并采用Transformer编码器-解码器架构。通过使用一组固定的可学习注视点查询,交叉注意力机制对图像特征进行推理以直接输出注视点,这使其区别于其他现代显著性预测器。我们提出的方法名为Saliency Transformer (SalTR),在Salicon和MIT300基准测试中达到了与最先进方法相当的性能指标。