Multi-modal fusion is vital to the success of super-resolution of depth images. However, commonly used fusion strategies, such as addition and concatenation, fall short of effectively bridging the modal gap. As a result, guided image filtering methods have been introduced to mitigate this issue. Nevertheless, it is observed that their filter kernels usually encounter significant texture interference and edge inaccuracy. To tackle these two challenges, we introduce a Scene Prior Filtering network, SPFNet, which utilizes the priors surface normal and semantic map from large-scale models. Specifically, we design an All-in-one Prior Propagation that computes the similarity between multi-modal scene priors, \textit{i.e.}, RGB, normal, semantic, and depth, to reduce the texture interference. In addition, we present a One-to-one Prior Embedding that continuously embeds each single-modal prior into depth using Mutual Guided Filtering, further alleviating the texture interference while enhancing edges. Our SPFNet has been extensively evaluated on both real and synthetic datasets, achieving state-of-the-art performance.
翻译:多模态融合对于深度图像超分辨率的成功至关重要。然而,常用的融合策略(如加法与拼接)难以有效弥合模态差距。为此,引导图像滤波方法被引入以缓解该问题。但研究发现,其滤波核通常面临显著的纹理干扰与边缘不精确问题。针对这两个挑战,我们提出场景先验滤波网络SPFNet,该网络利用大规模模型提取的先验表面法向量与语义图。具体而言,我们设计了一种全融合先验传播机制,通过计算多模态场景先验(即RGB、法向量、语义与深度)之间的相似性来减少纹理干扰。此外,我们提出一种一对一先验嵌入方法,通过互引导滤波将各单模态先验持续嵌入深度信息中,在增强边缘的同时进一步缓解纹理干扰。我们的SPFNet在真实与合成数据集上均得到广泛评估,实现了最先进的性能表现。