This paper presents a deep learning framework for medical video segmentation. Convolution neural network (CNN) and transformer-based methods have achieved great milestones in medical image segmentation tasks due to their incredible semantic feature encoding and global information comprehension abilities. However, most existing approaches ignore a salient aspect of medical video data - the temporal dimension. Our proposed framework explicitly extracts features from neighbouring frames across the temporal dimension and incorporates them with a temporal feature blender, which then tokenises the high-level spatio-temporal feature to form a strong global feature encoded via a Swin Transformer. The final segmentation results are produced via a UNet-like encoder-decoder architecture. Our model outperforms other approaches by a significant margin and improves the segmentation benchmarks on the VFSS2022 dataset, achieving a dice coefficient of 0.8986 and 0.8186 for the two datasets tested. Our studies also show the efficacy of the temporal feature blending scheme and cross-dataset transferability of learned capabilities. Code and models are fully available at https://github.com/SimonZeng7108/Video-SwinUNet.
翻译:本文提出了一种用于医学视频分割的深度学习框架。卷积神经网络(CNN)与基于Transformer的方法凭借其卓越的语义特征编码和全局信息理解能力,在医学图像分割任务中取得了重大突破。然而,现有方法大多忽视了医学视频数据的一个关键维度——时间维度。我们提出的框架显式地从相邻帧中沿时间维度提取特征,并通过时间特征融合器进行整合,随后将这些高层时空特征进行标记化处理,形成经由Swin Transformer编码的强全局特征。最终分割结果通过类UNet的编码器-解码器架构生成。我们的模型以显著优势超越了其他方法,在VFSS2022数据集上改进了分割基准,在两个测试数据集上分别达到了0.8986和0.8186的Dice系数。研究还表明,该时间特征融合方案具有有效性,且所学能力具备跨数据集迁移性。代码与模型已在https://github.com/SimonZeng7108/Video-SwinUNet 完全开源。