Traditional convolutional neural networks have a limited receptive field while transformer-based networks are mediocre in constructing long-term dependency from the perspective of computational complexity. Such the bottleneck poses a significant challenge when processing long video sequences in video analysis tasks. Very recently, the state space models (SSMs) with efficient hardware-aware designs, famous by Mamba, have exhibited impressive achievements in long sequence modeling, which facilitates the development of deep neural networks on many vision tasks. To better capture available cues in video frames, this paper presents a generic Video Vision Mamba-based framework for medical video object segmentation tasks, named Vivim. Our Vivim can effectively compress the long-term spatiotemporal representation into sequences at varying scales by our designed Temporal Mamba Block. Compared to existing video-level Transformer-based methods, our model maintains excellent segmentation results with better speed performance. Extensive experiments on the breast US dataset demonstrate the effectiveness and efficiency of our Vivim. The code for Vivim is available at: https://github.com/scott-yjyang/Vivim.
翻译:摘要:传统卷积神经网络感受野有限,而基于Transformer的网络在处理长期依赖关系时计算复杂度较高。这种瓶颈在处理视频分析任务中的长视频序列时带来了显著挑战。近期,以Mamba为代表的高效硬件感知状态空间模型(SSMs)在长序列建模方面取得了令人瞩目的成果,推动了深度神经网络在多类视觉任务中的发展。为更充分利用视频帧中的可用线索,本文提出了一种基于视频视觉曼巴的通用框架(命名为Vivim),用于医学视频目标分割任务。通过设计的时间曼巴模块(Temporal Mamba Block),Vivim能有效将长期时空表征压缩为多尺度序列。与现有基于视频级Transformer的方法相比,本模型在保持出色分割精度的同时具备更优的速度性能。在乳腺超声数据集上的大量实验验证了Vivim的有效性与高效性。Vivim的代码已开源:https://github.com/scott-yjyang/Vivim。