Automatic activity detection is an important component for developing technologies that enable next generation surgical devices and workflow monitoring systems. In many application, the videos of interest are long and include several activities; hence, the deep models designed for such purposes consist of a backbone and a temporal sequence modeling architecture. In this paper, we investigate both the state-of-the-art activity recognition and temporal models to find the architectures that yield the highest performance. We first benchmark these models on a large-scale activity recognition dataset in the operating room with over 800 full-length surgical videos. However, since most other medical applications lack such a large dataset, we further evaluate our models on the Cholec80 surgical phase segmentation dataset, consisting of only 40 training videos. For backbone architectures, we investigate both 3D ConvNets and most recent transformer-based models; for temporal modeling, we include temporal ConvNets, RNNs, and transformer models for a comprehensive and thorough study. We show that even in the case of limited labeled data, we can outperform the existing work by benefiting from models pre-trained on other tasks.
翻译:自动活动探测是开发技术以促成下一代外科手术器械和工作流程监测系统的重要组成部分。在许多应用中,有兴趣的视频是长长的,包括若干活动;因此,为此目的设计的深层模型包括一个主干和时间序列建模结构。在本文件中,我们调查最先进的活动识别模型和时间模型,以找到产生最高性能的架构。我们首先将这些模型以操作室大型活动识别数据集为基准,使用800多部全长外科录像。然而,由于大多数其他医疗应用程序缺乏如此庞大的数据集,我们进一步评估了Cholec80外科切分解数据集的模型,只有40个培训视频。对于主干结构,我们调查了3D ConvNets和最新的变压器模型;对于时间模型,我们包括时间型模型、ConvNets、RNNS和变压器模型,以便进行全面和彻底的研究。我们显示,即使有有限的标签数据,我们也能从模型中得过现有工作。