Monocular depth estimation in endoscopy videos can enable assistive and robotic surgery to obtain better coverage of the organ and detection of various health issues. Despite promising progress on mainstream, natural image depth estimation, techniques perform poorly on endoscopy images due to a lack of strong geometric features and challenging illumination effects. In this paper, we utilize the photometric cues, i.e., the light emitted from an endoscope and reflected by the surface, to improve monocular depth estimation. We first create two novel loss functions with supervised and self-supervised variants that utilize a per-pixel shading representation. We then propose a novel depth refinement network (PPSNet) that leverages the same per-pixel shading representation. Finally, we introduce teacher-student transfer learning to produce better depth maps from both synthetic data with supervision and clinical data with self-supervision. We achieve state-of-the-art results on the C3VD dataset while estimating high-quality depth maps from clinical data. Our code, pre-trained models, and supplementary materials can be found on our project page: https://ppsnet.github.io/
翻译:内窥镜视频中的单目深度估计能够辅助手术机器人和人为手术,实现对器官的更全面覆盖及多种健康问题的检测。尽管主流自然图像深度估计取得了显著进展,但由于缺乏强几何特征且面临具有挑战性的光照效应,现有技术在内窥镜图像上表现不佳。本文利用光度线索(即内窥镜发射并经表面反射的光)改进单目深度估计。我们首先提出两种基于逐像素着色表征的损失函数,包含监督与自监督变体;随后设计一种利用相同逐像素着色表征的新型深度细化网络(PPSNet);最后引入师生迁移学习,在合成数据上通过监督学习、临床数据上通过自监督学习生成更优深度图。在C3VD数据集上取得了最优结果,并能从临床数据中估计高质量深度图。我们的代码、预训练模型及补充材料详见项目页面:https://ppsnet.github.io/