Learning from a tiny dataset of manual annotations: a teacher/student approach for surgical phase recognition

Vision algorithms capable of interpreting scenes from a real-time video stream are necessary for computer-assisted surgery systems to achieve context-aware behavior. In laparoscopic procedures one particular algorithm needed for such systems is the identification of surgical phases, for which the current state of the art is a model based on a CNN-LSTM. A number of previous works using models of this kind have trained them in a fully supervised manner, requiring a fully annotated dataset. Instead, our work confronts the problem of learning surgical phase recognition in scenarios presenting scarce amounts of annotated data (under 25% of all available video recordings). We propose a teacher/student type of approach, where a strong predictor called the teacher, trained beforehand on a small dataset of ground truth-annotated videos, generates synthetic annotations for a larger dataset, which another model - the student - learns from. In our case, the teacher features a novel CNN-biLSTM-CRF architecture, designed for offline inference only. The student, on the other hand, is a CNN-LSTM capable of making real-time predictions. Results for various amounts of manually annotated videos demonstrate the superiority of the new CNN-biLSTM-CRF predictor as well as improved performance from the CNN-LSTM trained using synthetic labels generated for unannotated videos. For both offline and online surgical phase recognition with very few annotated recordings available, this new teacher/student strategy provides a valuable performance improvement by efficiently leveraging the unannotated data.

翻译：能够从实时视频流中解读场景的愿景算法对于计算机辅助外科手术系统实现环境觉醒行为来说是必要的。在腹腔外科手术程序方面,这种系统所需的一种特殊算法是确定外科阶段,在这方面,目前最先进的是一种基于CNN-LSTM的模型。以前使用这种模型的一些工作已经以充分监督的方式培训了这些阶段,需要有一个完全附加说明的数据集。相反,我们的工作遇到了在显示附加注释的数据数量很少的情况下学习外科手术阶段识别的问题(占所有可用视频录音的25%以下)。我们建议采用一种方法,在这种方法中,一个强有力的预测者先用地面附加注释的视频小数据集培训教师,为更大的数据集制作合成说明说明,另一个模型是学生学习的。在我们的例子中,教师以新的CNNISTM-C-CRF结构为新设计,学生通过新的CNN-LS-LTM系统进行不实时预测,用经过培训的SIC-RSDS的高级性能,用经过强化的SIC-RSLS的手动性视频,用经过精细化的图像,用SLSA的高级的图像显示SDRSDR的高级的高级预感官,用S的图像,用SDSDR的高级的高级性能,作为SDRSDR的高级预感光级的图像,通过SDRSDS的高级预的高级预的高级的高级性能,通过S的高级的图像,展示的图像,展示。