Accurate vision-based navigation in monocular endoscopy is difficult due to limited depth cues, weak tissue texture, non-rigid deformation, and substantial appearance variation across domains, all of which complicate pose estimation, depth prediction, and image-to-anatomy alignment. Although recent vision foundation models have shown promise, their learned representations often remain insufficiently geometry-consistent, hindering stable feature correspondence and limiting their reliability for downstream navigation tasks. We propose a unified framework for learning geometry-consistent and domain-robust image representations for monocular endoscopy. The framework combines a synthetic data pipeline that provides accurate geometric supervision with Hierarchy-Aware Geometry-Semantic Adaptation, a structured alternative to standard LoRA that inserts low-rank adapters selectively across the transformer hierarchy and couples them with layer-wise training objectives to encourage geometric correspondence in intermediate features and semantic consistency in deeper features. Experiments on public and proprietary datasets show improved geometric and semantic representation quality, leading to better performance on downstream navigation tasks including pose estimation and monocular depth estimation. The learned representations show favorable synthetic-to-real transfer on clinical bronchoscopy and provide a useful initialization for adaptation to sinus endoscopy and colonoscopy under limited supervision. The framework also shows favorable scaling with model size and training data. These results support hierarchy-aware, geometry-guided adaptation as a practical approach for endoscopic representation learning.
翻译:单目内窥镜中的精准视觉导航因深度线索有限、组织纹理弱、非刚性形变及跨域外观差异大而困难重重,这些因素导致位姿估计、深度预测及图像与解剖结构对齐任务复杂化。尽管近期视觉基础模型展现出潜力,但其学习到的表示往往缺乏足够的几何一致性,阻碍了稳定的特征对应,并限制了其在导航下游任务中的可靠性。我们提出一个统一框架,用于学习单目内窥镜中几何一致且域鲁棒的图像表示。该框架结合了提供精确几何监督的合成数据流水线,以及层级感知几何语义适配——一种标准LoRA的结构化替代方案,它在Transformer层级中选择性插入低秩适配器,并配合逐层训练目标,以鼓励中间特征中的几何对应与深层特征中的语义一致性。在公开与私有数据集上的实验表明,该框架提升了几何与语义表示质量,进而改进了位姿估计与单目深度估计等下游导航任务的性能。学习到的表示在临床支气管镜中展现出良好的合成到真实迁移能力,并为在有限监督下适应鼻窦镜与结肠镜检查提供了有效初始化。该框架还随模型规模与训练数据量表现出良好的缩放特性。这些结果支持层级感知、几何引导的适配作为内窥镜表示学习的实用方案。