Multi-modal fusion is vital to the success of super-resolution of depth maps. However, commonly used fusion strategies, such as addition and concatenation, fall short of effectively bridging the modal gap. As a result, guided image filtering methods have been introduced to mitigate this issue. Nevertheless, it is observed that their filter kernels usually encounter significant texture interference and edge inaccuracy. To tackle these two challenges, we introduce a Scene Prior Filtering network, SPFNet, which utilizes the priors surface normal and semantic map from large-scale models. Specifically, we design an All-in-one Prior Propagation that computes the similarity between multi-modal scene priors, i.e., RGB, normal, semantic, and depth, to reduce the texture interference. In addition, we present a One-to-one Prior Embedding that continuously embeds each single-modal prior into depth using Mutual Guided Filtering, further alleviating the texture interference while enhancing edges. Our SPFNet has been extensively evaluated on both real and synthetic datasets, achieving state-of-the-art performance.
翻译:多模态融合对深度图超分辨率任务的成功至关重要。然而,常用的融合策略如加法和拼接,难以有效弥合模态之间的差距。为此,引导图像滤波方法被引入以缓解这一问题。但研究发现,这类滤波核通常面临显著的纹理干扰和边缘不准确问题。针对这两项挑战,我们提出场景先验滤波网络SPFNet,该网络利用大规模模型中的先验表面法向图和语义图。具体而言,我们设计了全模态先验传播模块,通过计算多模态场景先验(即RGB、法向、语义和深度)之间的相似性来减少纹理干扰。此外,我们提出单模态先验嵌入模块,利用互引导滤波将每个单模态先验连续嵌入深度中,在增强边缘的同时进一步缓解纹理干扰。我们的SPFNet在真实与合成数据集上进行了广泛评估,取得了当前最优性能。