Guided depth super-resolution (GDSR) is a multi-modal approach for depth map super-resolution that relies on a low-resolution depth map and a high-resolution RGB image to restore finer structural details. However, the misleading color and texture cues indicating depth discontinuities in RGB images often lead to artifacts and blurred depth boundaries in the generated depth map. We propose a solution that introduces global contextual semantic priors, generated from pretrained vision transformer token embeddings. Our approach to distilling semantic knowledge from pretrained token embeddings is motivated by their demonstrated effectiveness in related monocular depth estimation tasks. We introduce a Guided Token Attention (GTA) module, which iteratively aligns encoded RGB spatial features with depth encodings, using cross-attention for selectively injecting global semantic context extracted from different layers of a pretrained vision transformer. Additionally, we present an architecture called Neural Attention for Implicit Multi-token Alignment (NAIMA), which integrates DINOv2 with GTA blocks for a semantics-aware GDSR. Our proposed architecture, with its ability to distill semantic knowledge, achieves significant improvements over existing methods across multiple scaling factors and datasets.
翻译:引导深度超分辨率(GDSR)是一种多模态深度图超分辨率方法,它依赖低分辨率深度图和高分辨率RGB图像来恢复更精细的结构细节。然而,RGB图像中指示深度不连续性的误导性颜色和纹理线索,常导致生成的深度图中出现伪影和模糊的深度边界。我们提出一种解决方案,引入基于预训练视觉变换器令牌嵌入生成的全局上下文语义先验。我们从预训练令牌嵌入中提炼语义知识的方法,源于其在相关单目深度估计任务中展现的有效性。我们引入一种引导式令牌注意力(GTA)模块,该模块通过交叉注意力机制,迭代地将编码后的RGB空间特征与深度编码对齐,从而选择性注入从预训练视觉变换器不同层提取的全局语义上下文。此外,我们提出一种名为隐式多令牌对齐神经注意力(NAIMA)的架构,该架构将DINOv2与GTA模块集成,实现语义感知的GDSR。所提架构凭借其提炼语义知识的能力,在多个缩放因子和数据集上均取得了相较于现有方法的显著改进。