Multi-task learning with cross-task consistency for improved depth estimation in colonoscopy

Colonoscopy screening is the gold standard procedure for assessing abnormalities in the colon and rectum, such as ulcers and cancerous polyps. Measuring the abnormal mucosal area and its 3D reconstruction can help quantify the surveyed area and objectively evaluate disease burden. However, due to the complex topology of these organs and variable physical conditions, for example, lighting, large homogeneous texture, and image modality estimating distance from the camera aka depth) is highly challenging. Moreover, most colonoscopic video acquisition is monocular, making the depth estimation a non-trivial problem. While methods in computer vision for depth estimation have been proposed and advanced on natural scene datasets, the efficacy of these techniques has not been widely quantified on colonoscopy datasets. As the colonic mucosa has several low-texture regions that are not well pronounced, learning representations from an auxiliary task can improve salient feature extraction, allowing estimation of accurate camera depths. In this work, we propose to develop a novel multi-task learning (MTL) approach with a shared encoder and two decoders, namely a surface normal decoder and a depth estimator decoder. Our depth estimator incorporates attention mechanisms to enhance global context awareness. We leverage the surface normal prediction to improve geometric feature extraction. Also, we apply a cross-task consistency loss among the two geometrically related tasks, surface normal and camera depth. We demonstrate an improvement of 14.17% on relative error and 10.4% improvement on $\delta_{1}$ accuracy over the most accurate baseline state-of-the-art BTS approach. All experiments are conducted on a recently released C3VD dataset; thus, we provide a first benchmark of state-of-the-art methods.

翻译：结肠镜检查是评估结肠和直肠异常（如溃疡和癌性息肉）的金标准程序。测量异常黏膜区域及其三维重建有助于量化检查范围并客观评估疾病负担。然而，由于这些器官复杂的拓扑结构以及多变的物理条件（例如光照、大面积均匀纹理和图像模态），估计相机距离（即深度）极具挑战性。此外，大多数结肠镜视频采集为单目方式，使得深度估计成为一项难题。尽管计算机视觉领域已在自然场景数据集上提出并发展了深度估计方法，但这些技术在结肠镜数据集上的有效性尚未得到广泛量化。由于结肠黏膜存在多个纹理不显著的低纹理区域，从辅助任务中学习表征可增强显著特征提取，从而更准确地估计相机深度。在本研究中，我们提出了一种新型多任务学习方法，采用共享编码器与两个解码器（即表面法线解码器和深度估计解码器）。深度估计器引入注意力机制以增强全局上下文感知能力，利用表面法线预测改进几何特征提取，并在表面法线与相机深度这两个几何相关任务间应用跨任务一致性损失。与现有最精准基线方法BTS相比，我们的方法在相对误差上降低了14.17%，在$\delta_{1}$准确率上提升了10.4%。所有实验均基于最新发布的C3VD数据集进行，因此我们首次提供了该数据集上现有最优方法的基准测试结果。