Investigating a Policy-Based Formulation for Endoscopic Camera Pose Recovery

Jan Emily Mangulabnan,Akshat Chauhan,Laura Fleig,Lalithkumar Seenivasan,Roger D. Soberanis-Mukul,S. Swaroop Vedula,Russell H. Taylor,Masaru Ishii,Gregory D. Hager,Mathias Unberath

In endoscopic surgery, surgeons continuously locate the endoscopic view relative to the anatomy by interpreting the evolving visual appearance of the intraoperative scene in the context of their prior knowledge. Vision-based navigation systems seek to replicate this capability by recovering camera pose directly from endoscopic video, but most approaches do not embody the same principles of reasoning about new frames that makes surgeons successful. Instead, they remain grounded in feature matching and geometric optimization over keyframes, an approach that has been shown to degrade under the challenging conditions of endoscopic imaging like low texture and rapid illumination changes. Here, we pursue an alternative approach and investigate a policy-based formulation of endoscopic camera pose recovery that seeks to imitate experts in estimating trajectories conditioned on the previous camera state. Our approach directly predicts short-horizon relative motions without maintaining an explicit geometric representation at inference time. It thus addresses, by design, some of the notorious challenges of geometry-based approaches, such as brittle correspondence matching, instability in texture-sparse regions, and limited pose coverage due to reconstruction failure. We evaluate the proposed formulation on cadaveric sinus endoscopy. Under oracle state conditioning, we compare short-horizon motion prediction quality to geometric baselines achieving lowest mean translation error and competitive rotational accuracy. We analyze robustness by grouping prediction windows according to texture richness and illumination change indicating reduced sensitivity to low-texture conditions. These findings suggest that a learned motion policy offers a viable alternative formulation for endoscopic camera pose recovery.

翻译：在内窥镜手术中，外科医生通过将术中场景的动态视觉表现与自身先验知识相结合，持续定位内窥镜视角与解剖结构的相对位置。基于视觉的导航系统旨在通过直接从内窥镜视频恢复相机姿态来复制这一能力，但大多数方法并未体现使外科医生成功分析新帧的推理原则。相反，这些方法仍基于关键帧的特征匹配与几何优化，而这种策略在内窥镜成像的低纹理、快速光照变化等挑战性条件下已被证实性能严重退化。本文探索另一种路径，研究基于策略的内窥镜相机姿态恢复方法——该方法通过模仿专家在给定先前相机状态条件下估计轨迹的行为，直接预测短时程相对运动，而无需在推理阶段维护显式几何表征。这种设计从根本上解决了基于几何方法的一些固有难题：脆弱的对应点匹配、纹理稀疏区域的不稳定性，以及因重建失败导致的姿态覆盖范围受限问题。我们在尸体鼻窦内镜数据集上评估所提出的方案。在理想状态条件下，我们将短时程运动预测质量与几何基线方法进行比较，取得了最低平均平移误差和具有竞争力的旋转精度。通过根据纹理丰富度与光照变化对预测窗口分组进行鲁棒性分析，结果表明该方法对低纹理条件的敏感性降低。这些发现表明，学习型运动策略为内窥镜相机姿态恢复提供了一种可行的替代方案。