There has been a growing interest in using deep learning models for processing long surgical videos, in order to automatically detect clinical/operational activities and extract metrics that can enable workflow efficiency tools and applications. However, training such models require vast amounts of labeled data which is costly and not scalable. Recently, self-supervised learning has been explored in computer vision community to reduce the burden of the annotation cost. Masked autoencoders (MAE) got the attention in self-supervised paradigm for Vision Transformers (ViTs) by predicting the randomly masked regions given the visible patches of an image or a video clip, and have shown superior performance on benchmark datasets. However, the application of MAE in surgical data remains unexplored. In this paper, we first investigate whether MAE can learn transferrable representations in surgical video domain. We propose SurgMAE, which is a novel architecture with a masking strategy based on sampling high spatio-temporal tokens for MAE. We provide an empirical study of SurgMAE on two large scale long surgical video datasets, and find that our method outperforms several baselines in low data regime. We conduct extensive ablation studies to show the efficacy of our approach and also demonstrate it's superior performance on UCF-101 to prove it's generalizability in non-surgical datasets as well.
翻译:近年来,利用深度学习模型处理长手术视频以自动检测临床/操作活动并提取可优化工作效率工具与应用的指标,引起了广泛关注。然而,此类模型的训练需要大量标注数据,成本高昂且难以扩展。近期,计算机视觉领域探索了自监督学习以降低标注成本。掩码自编码器(MAE)通过根据图像或视频片段的可见补丁预测随机掩码区域,在视觉Transformer(ViT)的自监督范式中备受关注,并在基准数据集上展现出优越性能。然而,MAE在手术数据中的应用尚未被探索。本文首次研究了MAE能否在手术视频领域学习可迁移表征。我们提出SurgMAE,这是一种基于高时空令牌采样掩码策略的新颖架构。通过对两个大规模长手术视频数据集的实证研究,我们发现该方法在低数据场景下优于多个基线模型。我们进行了广泛的消融实验以证明该方法的有效性,并在UCF-101上展示了其超越非手术数据集的泛化能力。