Purpose: Surgical scene understanding is key to advancing computer-aided and intelligent surgical systems. Current approaches predominantly rely on visual data or end-to-end learning, which limits fine-grained contextual modeling. This work aims to enhance surgical scene representations by integrating 3D acoustic information, enabling temporally and spatially aware multimodal understanding of surgical environments. Methods: We propose a novel framework for generating 4D audio-visual representations of surgical scenes by projecting acoustic localization information from a phased microphone array onto dynamic point clouds from an RGB-D camera. A transformer-based acoustic event detection module identifies relevant temporal segments containing tool-tissue interactions which are spatially localized in the audio-visual scene representation. The system was experimentally evaluated in a realistic operating room setup during simulated surgical procedures performed by experts. Results: The proposed method successfully localizes surgical acoustic events in 3D space and associates them with visual scene elements. Experimental evaluation demonstrates accurate spatial sound localization and robust fusion of multimodal data, providing a comprehensive, dynamic representation of surgical activity. Conclusion: This work introduces the first approach for spatial sound localization in dynamic surgical scenes, marking a significant advancement toward multimodal surgical scene representations. By integrating acoustic and visual data, the proposed framework enables richer contextual understanding and provides a foundation for future intelligent and autonomous surgical systems.
翻译:目的:手术场景理解是推动计算机辅助与智能手术系统发展的关键。当前方法主要依赖视觉数据或端到端学习,限制了细粒度上下文建模。本研究旨在通过整合三维声学信息增强手术场景表征,实现对手术环境的时空感知多模态理解。方法:我们提出一种新颖框架,通过将相控麦克风阵列的声学定位信息投影至RGB-D相机获取的动态点云,生成手术场景的四维视听表征。基于Transformer的声学事件检测模块识别包含工具-组织交互的相关时间片段,并在视听场景表征中进行空间定位。该系统在专家执行的模拟手术过程中,通过真实手术室设置进行了实验评估。结果:所提方法成功实现了手术声学事件在三维空间中的定位,并将其与视觉场景元素关联。实验评估表明,该方法具有精确的空间声源定位能力及稳健的多模态数据融合效果,提供了全面、动态的手术活动表征。结论:本研究首次提出了动态手术场景中空间声源定位的方法,标志着多模态手术场景表征的重要进展。通过融合声学与视觉数据,该框架实现了更丰富的上下文理解,为未来智能自主手术系统奠定了理论基础。