Real-time recognition and prediction of surgical activities are fundamental to advancing safety and autonomy in robot-assisted surgery. This paper presents a multimodal transformer architecture for real-time recognition and prediction of surgical gestures and trajectories based on short segments of kinematic and video data. We conduct an ablation study to evaluate the impact of fusing different input modalities and their representations on gesture recognition and prediction performance. We perform an end-to-end assessment of the proposed architecture using the JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) dataset. Our model outperforms the state-of-the-art (SOTA) with 89.5\% accuracy for gesture prediction through effective fusion of kinematic features with spatial and contextual video features. It achieves the real-time performance of 1.1-1.3ms for processing a 1-second input window by relying on a computationally efficient model.
翻译:手术活动的实时识别与预测是提升机器人辅助手术安全性与自主性的基础。本文提出了一种基于短段运动学与视频数据的多模态Transformer架构,用于实时识别与预测手术手势及轨迹。我们通过消融实验,评估了融合不同输入模态及其表示对手势识别与预测性能的影响。利用JHU-ISI手势与技能评估工作集(JIGSAWS)数据集,对所提架构进行了端到端评估。通过有效融合运动学特征与空间及上下文视频特征,模型在手势预测准确率上达到89.5%,优于现有最先进方法(SOTA)。在计算效率方面,该模型处理1秒输入窗口的实时性能为1.1-1.3毫秒。