Data-centric Design of Learning-based Surgical Gaze Perception Models in Multi-Task Simulation

In robot-assisted minimally invasive surgery (RMIS), reduced haptic feedback and depth cues increase reliance on expert visual perception, motivating gaze-guided training and learning-based surgical perception models. However, operative expert gaze is costly to collect, and it remains unclear how the source of gaze supervision, both expertise level (intermediate vs. novice) and perceptual modality (active execution vs. passive viewing), shapes what attention models learn. We introduce a paired active-passive, multi-task surgical gaze dataset collected on the da Vinci SimNow simulator across four drills. Active gaze was recorded during task execution using a VR headset with eye tracking, and the corresponding videos were reused as stimuli to collect passive gaze from observers, enabling controlled same-video comparisons. We quantify skill- and modality-dependent differences in gaze organization and evaluate the substitutability of passive gaze for operative supervision using fixation density overlap analyses and single-frame saliency modeling. Across settings, MSI-Net produced stable, interpretable predictions, whereas SalGAN was unstable and often poorly aligned with human fixations. Models trained on passive gaze recovered a substantial portion of intermediate active attention, but with predictable degradation, and transfer was asymmetric between active and passive targets. Notably, novice passive labels approximated intermediate-passive targets with limited loss on higher-quality demonstrations, suggesting a practical path for scalable, crowd-sourced gaze supervision in surgical coaching and perception modeling.

翻译：在机器人辅助微创手术中，触觉反馈和深度线索的减少增加了对专家视觉感知的依赖，这推动了视觉引导训练和基于学习的手术感知模型的发展。然而，获取手术专家视觉注视数据的成本高昂，而且视觉监督来源——包括专业水平（中级 vs. 新手）和感知模态（主动执行 vs. 被动观察）——如何影响注意力模型的学习能力尚不明确。我们引入了一个配对主动-被动、多任务的手术注视数据集，该数据集使用达芬奇SimNow模拟器在四项标准操作训练任务中采集。主动注视数据通过使用带有眼动追踪功能的VR头显在任务执行期间记录，而相应的视频则作为刺激材料重新用于从观察者处采集被动注视数据，从而实现了可控的相同视频对比。我们量化了注视组织中技能和模态依赖的差异，并通过注视密度重叠分析和单帧显著性建模评估了被动注视替代主动监督的可行性。在各种设置下，MSI-Net生成了稳定且可解释的预测，而SalGAN则不稳定且常与人类注视点对齐不良。基于被动注视训练模型可恢复中级主动注意的绝大部分，但存在可预测的性能衰减，且主动与被动目标之间的迁移具有不对称性。值得注意的是，在较高质量的操作演示中，新手被动标签近似于中级被动目标且性能损失有限，这为在手术训练指导与感知建模中实现可扩展的众包视觉监督提供了可行路径。