Purpose: Surgical scene understanding is key to advancing computer-aided and intelligent surgical systems. Current approaches predominantly rely on visual data or end-to-end learning, which limits fine-grained contextual modeling. This work aims to enhance surgical scene representations by integrating 3D acoustic information, enabling temporally and spatially aware multimodal understanding of surgical environments. Methods: We propose a novel framework for generating 4D audio-visual representations of surgical scenes by projecting acoustic localization information from a phased microphone array onto dynamic point clouds from an RGB-D camera. A transformer-based acoustic event detection module identifies relevant temporal segments containing tool-tissue interactions which are spatially localized in the audio-visual scene representation. The system was experimentally evaluated in a realistic operating room setup during simulated surgical procedures performed by experts. Results: The proposed method successfully localizes surgical acoustic events in 3D space and associates them with visual scene elements. Experimental evaluation demonstrates accurate spatial sound localization and robust fusion of multimodal data, providing a comprehensive, dynamic representation of surgical activity. Conclusion: This work introduces the first approach for spatial sound localization in dynamic surgical scenes, marking a significant advancement toward multimodal surgical scene representations. By integrating acoustic and visual data, the proposed framework enables richer contextual understanding and provides a foundation for future intelligent and autonomous surgical systems.
翻译:目的:手术场景理解是推动计算机辅助与智能手术系统发展的关键。当前方法主要依赖于视觉数据或端到端学习,这限制了细粒度上下文建模能力。本研究旨在通过集成三维声学信息来增强手术场景表征,实现对手术环境的时空感知多模态理解。方法:我们提出了一种新颖的框架,通过将相控麦克风阵列的声学定位信息投影到RGB-D相机生成的动态点云上,构建手术场景的四维视听表征。基于Transformer的声学事件检测模块识别包含器械-组织交互的关键时间片段,并将其空间定位在视听场景表征中。该系统在专家执行模拟手术的真实手术室设置中进行了实验评估。结果:所提方法成功实现了手术声学事件在三维空间中的定位,并将其与视觉场景元素相关联。实验评估证明了准确的空间声源定位能力及多模态数据的鲁棒融合,为手术活动提供了全面动态的表征。结论:本研究首次提出了动态手术场景中空间声源定位的方法,标志着多模态手术场景表征的重要进展。通过融合声学与视觉数据,该框架实现了更丰富的上下文理解,为未来智能自主手术系统的发展奠定了基础。