Disentangling spatio-temporal knowledge for weakly supervised object detection and segmentation in surgical video

Weakly supervised video object segmentation (WSVOS) enables the identification of segmentation maps without requiring an extensive training dataset of object masks, relying instead on coarse video labels indicating object presence. Current state-of-the-art methods either require multiple independent stages of processing that employ motion cues or, in the case of end-to-end trainable networks, lack in segmentation accuracy, in part due to the difficulty of learning segmentation maps from videos with transient object presence. This limits the application of WSVOS for semantic annotation of surgical videos where multiple surgical tools frequently move in and out of the field of view, a problem that is more difficult than typically encountered in WSVOS. This paper introduces Video Spatio-Temporal Disentanglement Networks (VDST-Net), a framework to disentangle spatiotemporal information using semi-decoupled knowledge distillation to predict high-quality class activation maps (CAMs). A teacher network designed to resolve temporal conflicts when specifics about object location and timing in the video are not provided works with a student network that integrates information over time by leveraging temporal dependencies. We demonstrate the efficacy of our framework on a public reference dataset and on a more challenging surgical video dataset where objects are, on average, present in less than 60\% of annotated frames. Our method outperforms state-of-the-art techniques and generates superior segmentation masks under video-level weak supervision.

翻译：弱监督视频目标分割（WSVOS）无需依赖大量目标掩码训练数据集，仅通过指示目标存在的粗略视频标签即可实现分割图的识别。当前最先进的方法要么需要采用运动线索的多阶段独立处理流程，要么在端到端可训练网络的情况下，分割精度不足，部分原因在于从目标短暂出现的视频中学习分割图存在困难。这限制了WSVOS在手术视频语义标注中的应用，因为手术中多种工具频繁进出视野，该问题比WSVOS通常面临的场景更为复杂。本文提出视频时空解耦网络（VDST-Net），该框架通过半解耦知识蒸馏解耦时空信息，以预测高质量的类激活图（CAMs）。教师网络旨在解决未提供视频中目标具体位置与时间信息时的时序冲突问题，学生网络则通过利用时序依赖性整合时间维度的信息。我们在公共基准数据集和更具挑战性的手术视频数据集上验证了框架的有效性，后者中目标平均出现于少于60%的标注帧。本方法在视频级弱监督下超越了现有最优技术，并生成更优的分割掩码。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

《图机器学习》课程

专知会员服务

49+阅读 · 2024年2月18日

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日