Self-Supervised Learning for Interventional Image Analytics: Towards Robust Device Trackers

An accurate detection and tracking of devices such as guiding catheters in live X-ray image acquisitions is an essential prerequisite for endovascular cardiac interventions. This information is leveraged for procedural guidance, e.g., directing stent placements. To ensure procedural safety and efficacy, there is a need for high robustness no failures during tracking. To achieve that, one needs to efficiently tackle challenges, such as: device obscuration by contrast agent or other external devices or wires, changes in field-of-view or acquisition angle, as well as the continuous movement due to cardiac and respiratory motion. To overcome the aforementioned challenges, we propose a novel approach to learn spatio-temporal features from a very large data cohort of over 16 million interventional X-ray frames using self-supervision for image sequence data. Our approach is based on a masked image modeling technique that leverages frame interpolation based reconstruction to learn fine inter-frame temporal correspondences. The features encoded in the resulting model are fine-tuned downstream. Our approach achieves state-of-the-art performance and in particular robustness compared to ultra optimized reference solutions (that use multi-stage feature fusion, multi-task and flow regularization). The experiments show that our method achieves 66.31% reduction in maximum tracking error against reference solutions (23.20% when flow regularization is used); achieving a success score of 97.95% at a 3x faster inference speed of 42 frames-per-second (on GPU). The results encourage the use of our approach in various other tasks within interventional image analytics that require effective understanding of spatio-temporal semantics.

翻译：在实时X射线影像采集过程中，精确检测与跟踪引导导管等器械是血管内心脏介入手术的重要前提。这些信息用于手术导航，例如指导支架放置。为确保手术安全与有效性，跟踪过程需具备高鲁棒性且无失败。为此，需有效应对如下挑战：器械被造影剂或其他外部设备/导线遮挡、视野或采集角度变化，以及心脏与呼吸运动导致的持续移动。针对上述挑战，我们提出了一种新方法，通过自监督学习从超过1600万帧介入X射线影像的超大规模数据中学习时空特征。该方法基于掩码图像建模技术，利用帧插值重建来学习精细的帧间时序对应关系。所得模型中的编码特征通过下游微调进行优化。与超优化的参考方案（采用多阶段特征融合、多任务与光流正则化）相比，我们的方法在性能尤其是鲁棒性上达到了最优。实验表明，相对于参考方案，该方法的最大跟踪误差降低了66.31%（使用光流正则化时为23.20%），成功率达97.95%，推理速度为每秒42帧（GPU上），提升3倍。该结果鼓励我们将所提方法应用于其他需有效理解时空语义的介入影像分析任务。