Multi-Modal Monocular Endoscopic Depth and Pose Estimation with Edge-Guided Self-Supervision

Monocular depth and pose estimation play an important role in the development of colonoscopy-assisted navigation, as they enable improved screening by reducing blind spots, minimizing the risk of missed or recurrent lesions, and lowering the likelihood of incomplete examinations. However, this task remains challenging due to the presence of texture-less surfaces, complex illumination patterns, deformation, and a lack of in-vivo datasets with reliable ground truth. In this paper, we propose **PRISM** (Pose-Refinement with Intrinsic Shading and edge Maps), a self-supervised learning framework that leverages anatomical and illumination priors to guide geometric learning. Our approach uniquely incorporates edge detection and luminance decoupling for structural guidance. Specifically, edge maps are derived using a learning-based edge detector (e.g., DexiNed or HED) trained to capture thin and high-frequency boundaries, while luminance decoupling is obtained through an intrinsic decomposition module that separates shading and reflectance, enabling the model to exploit shading cues for depth estimation. Experimental results on multiple real and synthetic datasets demonstrate state-of-the-art performance. We further conduct a thorough ablation study on training data selection to establish best practices for pose and depth estimation in colonoscopy. This analysis yields two practical insights: (1) self-supervised training on real-world data outperforms supervised training on realistic phantom data, underscoring the superiority of domain realism over ground truth availability; and (2) video frame rate is an extremely important factor for model performance, where dataset-specific video frame sampling is necessary for generating high quality training data.

翻译：单目深度与姿态估计在结肠镜辅助导航技术的发展中具有重要作用，它们通过减少盲区、降低漏诊或复发病灶的风险以及减少不完全检查的可能性，从而提升筛查效果。然而，由于存在纹理缺失表面、复杂光照模式、组织形变以及缺乏具有可靠真值的体内数据集，该任务仍具挑战性。本文提出**PRISM**（基于固有着色与边缘图的姿态优化），一种利用解剖与光照先验引导几何学习的自监督学习框架。我们的方法创新性地结合了边缘检测与亮度解耦以提供结构引导。具体而言，边缘图通过基于学习的边缘检测器（例如DexiNed或HED）生成，该检测器经训练以捕捉细薄且高频的边界；而亮度解耦则通过固有分解模块实现，该模块分离着色与反射分量，使模型能够利用着色线索进行深度估计。在多个真实与合成数据集上的实验结果表明了该方法的先进性能。我们进一步对训练数据选择进行了详尽的消融研究，以建立结肠镜中姿态与深度估计的最佳实践。该分析得出两个实用结论：（1）在真实数据上的自监督训练优于在仿真体模数据上的监督训练，这证实了领域真实性相对于真值可获取性的优越性；（2）视频帧率是影响模型性能的极重要因素，针对特定数据集的视频帧采样对于生成高质量训练数据是必要的。