Data-centric Design of Learning-based Surgical Gaze Perception Models in Multi-Task Simulation

In robot-assisted minimally invasive surgery (RMIS), reduced haptic feedback and depth cues increase reliance on expert visual perception, motivating gaze-guided training and learning-based surgical perception models. However, operative expert gaze is costly to collect, and it remains unclear how the source of gaze supervision, both expertise level (intermediate vs. novice) and perceptual modality (active execution vs. passive viewing), shapes what attention models learn. We introduce a paired active-passive, multi-task surgical gaze dataset collected on the da Vinci SimNow simulator across four drills. Active gaze was recorded during task execution using a VR headset with eye tracking, and the corresponding videos were reused as stimuli to collect passive gaze from observers, enabling controlled same-video comparisons. We quantify skill- and modality-dependent differences in gaze organization and evaluate the substitutability of passive gaze for operative supervision using fixation density overlap analyses and single-frame saliency modeling. Across settings, MSI-Net produced stable, interpretable predictions, whereas SalGAN was unstable and often poorly aligned with human fixations. Models trained on passive gaze recovered a substantial portion of intermediate active attention, but with predictable degradation, and transfer was asymmetric between active and passive targets. Notably, novice passive labels approximated intermediate-passive targets with limited loss on higher-quality demonstrations, suggesting a practical path for scalable, crowd-sourced gaze supervision in surgical coaching and perception modeling.

翻译：在机器人辅助微创手术中，触觉反馈和深度线索的减弱增加了对外科专家视觉感知的依赖，这推动了基于注视引导的训练和学习型手术感知模型的发展。然而，术中专家注视数据的采集成本高昂，且目前尚不清楚注视监督的来源——包括专业水平（中级与新手）和感知模态（主动执行与被动观察）——如何影响注意力模型的学习内容。我们引入了一个在达芬奇SimNow模拟器上通过四项训练任务收集的配对主动-被动多任务手术注视数据集。主动注视通过配备眼动追踪的VR头显在任务执行过程中记录，相应视频被重新用作刺激材料以收集观察者的被动注视，从而实现受控的同视频比较。我们通过注视密度重叠分析和单帧显著性建模，量化了注视组织在技能和模态上的差异，并评估了被动注视对术中监督的可替代性。在所有实验设置中，MSI-Net均能产生稳定、可解释的预测，而SalGAN则表现不稳定且常与人类注视点对齐不佳。使用被动注视训练的模型能够恢复中级主动注意力的相当大部分，但存在可预测的性能下降，且主动与被动目标间的迁移具有不对称性。值得注意的是，新手被动标注在更高质量演示任务上能以有限损失近似中级被动目标，这为手术教学和感知建模中可扩展的众包注视监督提供了一条实用路径。