Cardiac catheterization remains a cornerstone of minimally invasive interventions, yet it continues to rely heavily on manual operation. Despite advances in robotic platforms, existing systems are predominantly follow-leader in nature, requiring continuous physician input and lacking intelligent autonomy. This dependency contributes to operator fatigue, more radiation exposure, and variability in procedural outcomes. This work moves towards autonomous catheter navigation by introducing DINO-CVA, a multimodal goal-conditioned behavior cloning framework. The proposed model fuses visual observations and joystick kinematics into a joint embedding space, enabling policies that are both vision-aware and kinematic-aware. Actions are predicted autoregressively from expert demonstrations, with goal conditioning guiding navigation toward specified destinations. A robotic experimental setup with a synthetic vascular phantom was designed to collect multimodal datasets and evaluate performance. Results show that DINO-CVA achieves high accuracy in predicting actions, matching the performance of a kinematics-only baseline while additionally grounding predictions in the anatomical environment. These findings establish the feasibility of multimodal, goal-conditioned architectures for catheter navigation, representing an important step toward reducing operator dependency and improving the reliability of catheterbased therapies.
翻译:心脏导管介入术仍是微创介入治疗的基石,但其操作仍高度依赖人工。尽管机器人平台已取得进展,现有系统本质上仍以跟随操作为主,需要医师持续输入指令,缺乏智能自主性。这种依赖性导致操作者疲劳、辐射暴露增加以及手术结果的不一致性。本研究通过引入DINO-CVA——一种多模态目标条件行为克隆框架,向自主导管导航迈进一步。该模型将视觉观测与操纵杆运动学信息融合到联合嵌入空间中,使策略同时具备视觉感知与运动学感知能力。通过自回归方式从专家示范数据中预测动作,并利用目标条件机制引导导管向指定目标位置导航。我们设计了配备合成血管模型的机器人实验装置,用于收集多模态数据集并评估性能。结果表明,DINO-CVA在动作预测方面具有高精度,其性能与纯运动学基线模型相当,同时还能将预测结果与解剖环境信息相结合。这些发现证实了多模态目标条件架构用于导管导航的可行性,标志着在降低操作者依赖性和提高导管治疗可靠性方面迈出了重要一步。