Action recognition is a prerequisite for many applications in laparoscopic video analysis including but not limited to surgical training, operation room planning, follow-up surgery preparation, post-operative surgical assessment, and surgical outcome estimation. However, automatic action recognition in laparoscopic surgeries involves numerous challenges such as (I) cross-action and intra-action duration variation, (II) relevant content distortion due to smoke, blood accumulation, fast camera motions, organ movements, object occlusion, and (III) surgical scene variations due to different illuminations and viewpoints. Besides, action annotations in laparoscopy surgeries are limited and expensive due to requiring expert knowledge. In this study, we design and evaluate a CNN-RNN architecture as well as a customized training-inference framework to deal with the mentioned challenges in laparoscopic surgery action recognition. Using stacked recurrent layers, our proposed network takes advantage of inter-frame dependencies to negate the negative effect of content distortion and variation in action recognition. Furthermore, our proposed frame sampling strategy effectively manages the duration variations in surgical actions to enable action recognition with high temporal resolution. Our extensive experiments confirm the superiority of our proposed method in action recognition compared to static CNNs.
翻译:动作识别是腹腔镜视频分析中许多应用的前提,包括但不限于手术培训、手术室规划、后续手术准备、术后手术评估以及手术结果预测。然而,腹腔镜手术中的自动动作识别面临诸多挑战,例如:(I)跨动作及动作内时长的变化;(II)由烟雾、血液积聚、快速摄像头移动、器官运动、物体遮挡导致的相关内容失真;(III)不同光照和视角所引起的手术场景变化。此外,由于需要专业领域知识,腹腔镜手术中的动作标注有限且成本高昂。在本研究中,我们设计并评估了一种CNN-RNN架构及定制化的训练-推理框架,以应对腹腔镜手术动作识别中上述挑战。通过使用堆叠的循环层,我们提出的网络利用帧间依赖关系来消除内容失真和变化对动作识别的负面影响。此外,我们提出的帧采样策略有效管理了手术动作的时长变化,从而实现了高时间分辨率的动作识别。大量实验证实,我们提出的方法在动作识别方面优于静态CNN。