Audio-visual saliency prediction can draw support from diverse modality complements, but further performance enhancement is still challenged by customized architectures as well as task-specific loss functions. In recent studies, denoising diffusion models have shown more promising in unifying task frameworks owing to their inherent ability of generalization. Following this motivation, a novel Diffusion architecture for generalized audio-visual Saliency prediction (DiffSal) is proposed in this work, which formulates the prediction problem as a conditional generative task of the saliency map by utilizing input audio and video as the conditions. Based on the spatio-temporal audio-visual features, an extra network Saliency-UNet is designed to perform multi-modal attention modulation for progressive refinement of the ground-truth saliency map from the noisy map. Extensive experiments demonstrate that the proposed DiffSal can achieve excellent performance across six challenging audio-visual benchmarks, with an average relative improvement of 6.3\% over the previous state-of-the-art results by six metrics.
翻译:音视频显著性预测可从多模态信息互补中获益,但定制化架构与特定任务损失函数仍制约着性能的进一步提升。近期研究表明,去噪扩散模型凭借其固有的泛化能力,在统一任务框架方面展现出更优前景。基于这一动机,本文提出了一种新颖的广义音视频显著性预测扩散架构(DiffSal),该方法将预测问题建模为以输入音频与视频为条件生成显著性图的条件生成任务。在时空音视频特征基础上,本文设计了额外的Saliency-UNet网络,通过多模态注意力调制机制实现从噪声图到真实显著性图的渐进式优化。大量实验表明,所提出的DiffSal在六个具有挑战性的音视频基准测试中均能取得优异性能,相较于此前最优方法,在六项评价指标上平均获得6.3%的相对提升。