Surgical procedures are inherently complex and dynamic, with intricate dependencies and various execution paths. Accurate identification of the intentions behind critical actions, referred to as Primary Intentions (PIs), is crucial to understanding and planning the procedure. This paper presents a novel framework that advances PI recognition in instructional videos by combining top-down grammatical structure with bottom-up visual cues. The grammatical structure is based on a rich corpus of surgical procedures, offering a hierarchical perspective on surgical activities. A grammar parser, utilizing the surgical activity grammar, processes visual data obtained from laparoscopic images through surgical action detectors, ensuring a more precise interpretation of the visual information. Experimental results on the benchmark dataset demonstrate that our method outperforms existing surgical activity detectors that rely solely on visual features. Our research provides a promising foundation for developing advanced robotic surgical systems with enhanced planning and automation capabilities.
翻译:外科手术过程具有固有的复杂性和动态性,包含错综复杂的依赖关系和多样化的执行路径。准确识别关键动作背后的意图,即主要意图(PIs),对于理解和规划手术流程至关重要。本文提出了一种新颖的框架,通过结合自上而下的语法结构与自下而上的视觉线索,推进了教学视频中的主要意图识别。该语法结构基于丰富的手术流程语料库,为手术活动提供了层次化的视角。一个语法解析器利用手术活动语法,处理通过手术动作检测器从腹腔镜图像中获取的视觉数据,从而确保对视觉信息进行更精确的解读。在基准数据集上的实验结果表明,我们的方法优于仅依赖视觉特征的现有手术活动检测器。我们的研究为开发具有增强规划和自动化能力的高级机器人手术系统奠定了有前景的基础。