Saliency estimation has received growing attention in recent years due to its importance in a wide range of applications. In the context of 360-degree video, it has been particularly valuable for tasks such as viewport prediction and immersive content optimization. In this paper, we propose SalFormer360, a novel saliency estimation model for 360-degree videos built on a transformer-based architecture. Our approach is based on the combination of an existing encoder architecture, SegFormer, and a custom decoder. The SegFormer model was originally developed for 2D segmentation tasks, and it has been fine-tuned to adapt it to 360-degree content. To further enhance prediction accuracy in our model, we incorporated Viewing Center Bias to reflect user attention in 360-degree environments. Extensive experiments on the three largest benchmark datasets for saliency estimation demonstrate that SalFormer360 outperforms existing state-of-the-art methods. In terms of Pearson Correlation Coefficient, our model achieves 8.4% higher performance on Sport360, 2.5% on PVS-HM, and 18.6% on VR-EyeTracking compared to previous state-of-the-art.
翻译:显著性估计因其在广泛应用中的重要性,近年来受到越来越多的关注。在360度视频的背景下,该技术对视口预测和沉浸式内容优化等任务尤为关键。本文提出SalFormer360,一种基于Transformer架构的新型360度视频显著性估计模型。我们的方法结合了现有编码器架构SegFormer与定制解码器。SegFormer模型最初为二维分割任务开发,我们通过微调使其适配360度内容。为提升模型预测精度,我们引入了视域中心偏置机制以反映用户在360度环境中的注意力分布。在三个最大规模的显著性估计基准数据集上的大量实验表明,SalFormer360超越了现有最先进方法。就皮尔逊相关系数而言,我们的模型在Sport360数据集上性能提升8.4%,在PVS-HM数据集上提升2.5%,在VR-EyeTracking数据集上提升18.6%。