Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos

Despite significant advancements in robotic systems and surgical data science, ensuring safe and optimal execution in robot-assisted minimally invasive surgery (RMIS) remains a complex challenge. Current surgical error detection methods involve two parts: identifying surgical gestures and then detecting errors within each gesture clip. These methods seldom consider the rich contextual and semantic information inherent in surgical videos, limiting their performance due to reliance on accurate gesture identification. Motivated by the chain-of-thought prompting in natural language processing, this letter presents a novel and real-time end-to-end error detection framework, Chain-of-Thought (COG) prompting, leveraging contextual information from surgical videos. This encompasses two reasoning modules designed to mimic the decision-making processes of expert surgeons. Concretely, we first design a Gestural-Visual Reasoning module, which utilizes transformer and attention architectures for gesture prompting, while the second, a Multi-Scale Temporal Reasoning module, employs a multi-stage temporal convolutional network with both slow and fast paths for temporal information extraction. We extensively validate our method on the public benchmark RMIS dataset JIGSAWS. Our method encapsulates the reasoning processes inherent to surgical activities enabling it to outperform the state-of-the-art by 4.6% in F1 score, 4.6% in Accuracy, and 5.9% in Jaccard index while processing each frame in 6.69 milliseconds on average, demonstrating the great potential of our approach in enhancing the safety and efficacy of RMIS procedures and surgical education. The code will be available.

翻译：尽管机器人系统和手术数据科学已取得显著进展，确保机器人辅助微创手术的安全与最优执行仍是一项复杂挑战。当前手术错误检测方法包含两个部分：识别手术手势，然后在每个手势片段中检测错误。这些方法很少考虑手术视频中丰富的上下文和语义信息，由于依赖准确的手势识别，其性能受到限制。受自然语言处理中思维链提示的启发，本文提出了一种新颖的实时端到端错误检测框架——手势链式提示，利用手术视频中的上下文信息。该框架包含两个推理模块，旨在模拟专家外科医生的决策过程。具体而言，我们首先设计了一个手势-视觉推理模块，该模块利用Transformer和注意力架构进行手势提示；第二个模块是多尺度时序推理模块，采用具有慢速和快速路径的多阶段时序卷积网络来提取时序信息。我们在公开基准RMIS数据集JIGSAWS上广泛验证了我们的方法。我们的方法封装了手术活动固有的推理过程，使其在F1分数上优于现有最佳方法4.6%，准确率提高4.6%，杰卡德指数提升5.9%，同时平均每帧处理时间仅为6.69毫秒，这证明了我们的方法在提升RMIS手术安全性和有效性以及手术教育方面的巨大潜力。代码将公开提供。