Traditional convolutional neural networks have a limited receptive field while transformer-based networks are mediocre in constructing long-term dependency from the perspective of computational complexity. Such the bottleneck poses a significant challenge when processing long sequences in video analysis tasks. Very recently, the state space models (SSMs) with efficient hardware-aware designs, famous by Mamba, have exhibited impressive achievements in long sequence modeling, which facilitates the development of deep neural networks on many vision tasks. To better capture available dynamic cues in video frames, this paper presents a generic Video Vision Mamba-based framework, dubbed as \textbf{Vivim}, for medical video object segmentation tasks. Our Vivim can effectively compress the long-term spatiotemporal representation into sequences at varying scales by our designed Temporal Mamba Block. We also introduce a boundary-aware constraint to enhance the discriminative ability of Vivim on ambiguous lesions in medical images. Extensive experiments on thyroid segmentation in ultrasound videos and polyp segmentation in colonoscopy videos demonstrate the effectiveness and efficiency of our Vivim, superior to existing methods. The code is available at: https://github.com/scott-yjyang/Vivim.
翻译:摘要:传统卷积神经网络感受野受限,而基于Transformer的模型在构建长程依赖关系时计算复杂度过高。此类瓶颈在视频分析任务中处理长序列时带来了重大挑战。近年来,以Mamba为代表的具有高效硬件感知设计的状态空间模型,在长序列建模中展现出显著优势,推动了深度神经网络在诸多视觉任务中的发展。为更好地捕捉视频帧中的动态线索,本文提出一种通用的视频视觉曼巴框架——Vivim,用于医学视频对象分割任务。通过设计的时序曼巴模块,Vivim能够将长程时空表征有效压缩为多尺度序列。此外,我们引入边界感知约束,增强Vivim对医学图像中模糊病变区域的判别能力。在超声视频甲状腺分割与结肠镜视频息肉分割任务上的大量实验表明,Vivim在有效性和效率上均优于现有方法。代码开源于:https://github.com/scott-yjyang/Vivim。