Surgical-DINO: Adapter Learning of Foundation Models for Depth Estimation in Endoscopic Surgery

Purpose: Depth estimation in robotic surgery is vital in 3D reconstruction, surgical navigation and augmented reality visualization. Although the foundation model exhibits outstanding performance in many vision tasks, including depth estimation (e.g., DINOv2), recent works observed its limitations in medical and surgical domain-specific applications. This work presents a low-ranked adaptation (LoRA) of the foundation model for surgical depth estimation. Methods: We design a foundation model-based depth estimation method, referred to as Surgical-DINO, a low-rank adaptation of the DINOv2 for depth estimation in endoscopic surgery. We build LoRA layers and integrate them into DINO to adapt with surgery-specific domain knowledge instead of conventional fine-tuning. During training, we freeze the DINO image encoder, which shows excellent visual representation capacity, and only optimize the LoRA layers and depth decoder to integrate features from the surgical scene. Results: Our model is extensively validated on a MICCAI challenge dataset of SCARED, which is collected from da Vinci Xi endoscope surgery. We empirically show that Surgical-DINO significantly outperforms all the state-of-the-art models in endoscopic depth estimation tasks. The analysis with ablation studies has shown evidence of the remarkable effect of our LoRA layers and adaptation. Conclusion: Surgical-DINO shed some light on the successful adaptation of the foundation models into the surgical domain for depth estimation. There is clear evidence in the results that zero-shot prediction on pre-trained weights in computer vision datasets or naive fine-tuning is not sufficient to use the foundation model in the surgical domain directly. Code is available at https://github.com/BeileiCui/SurgicalDINO.

翻译：摘要：目的：机器人手术中的深度估计在三维重建、手术导航和增强现实可视化中至关重要。尽管基础模型（如DINOv2）在包括深度估计在内的许多视觉任务中表现出色，但近期研究观察到其在医学和手术领域特定应用中的局限性。本研究提出了一种基于低秩适配（LoRA）的基础模型方法，用于手术深度估计。方法：我们设计了一种基于基础模型的深度估计方法，称为Surgical-DINO，即DINOv2在内窥镜手术深度估计中的低秩适配。我们构建LoRA层并将其集成到DINO中，以适配手术特定领域知识，而非传统微调。训练过程中，我们冻结具有优异视觉表征能力的DINO图像编码器，仅优化LoRA层和深度解码器，以整合手术场景特征。结果：我们的模型在基于da Vinci Xi内窥镜手术收集的MICCAI挑战数据集SCARED上进行了广泛验证。实验表明，Surgical-DINO在内窥镜深度估计任务中显著优于所有现有最先进模型。消融研究分析证明了LoRA层及其适配的显著效果。结论：Surgical-DINO为基础模型成功适配手术领域进行深度估计提供了启示。结果明确显示，直接使用计算机视觉数据集预训练权重的零样本预测或简单微调不足以在手术领域直接应用基础模型。代码开源地址：https://github.com/BeileiCui/SurgicalDINO。