The application of 3D ViTs to medical image segmentation has seen remarkable strides, somewhat overshadowing the budding advancements in Convolutional Neural Network (CNN)-based models. Large kernel depthwise convolution has emerged as a promising technique, showcasing capabilities akin to hierarchical transformers and facilitating an expansive effective receptive field (ERF) vital for dense predictions. Despite this, existing core operators, ranging from global-local attention to large kernel convolution, exhibit inherent trade-offs and limitations (e.g., global-local range trade-off, aggregating attentional features). We hypothesize that deformable convolution can be an exploratory alternative to combine all advantages from the previous operators, providing long-range dependency, adaptive spatial aggregation and computational efficiency as a foundation backbone. In this work, we introduce 3D DeformUX-Net, a pioneering volumetric CNN model that adeptly navigates the shortcomings traditionally associated with ViTs and large kernel convolution. Specifically, we revisit volumetric deformable convolution in depth-wise setting to adapt long-range dependency with computational efficiency. Inspired by the concepts of structural re-parameterization for convolution kernel weights, we further generate the deformable tri-planar offsets by adapting a parallel branch (starting from $1\times1\times1$ convolution), providing adaptive spatial aggregation across all channels. Our empirical evaluations reveal that the 3D DeformUX-Net consistently outperforms existing state-of-the-art ViTs and large kernel convolution models across four challenging public datasets, spanning various scales from organs (KiTS: 0.680 to 0.720, MSD Pancreas: 0.676 to 0.717, AMOS: 0.871 to 0.902) to vessels (e.g., MSD hepatic vessels: 0.635 to 0.671) in mean Dice.
翻译:近年来,三维视觉Transformer在医学图像分割领域的应用取得了显著进展,在一定程度上掩盖了基于卷积神经网络(CNN)模型的蓬勃发展。大核深度可分离卷积展现出与层级式Transformer相媲美的潜力,能够形成对密集预测至关重要的扩展有效感受野。然而,现有核心算子(从全局-局部注意力到大型卷积核)均存在固有折衷与局限性(如全局-局部范围权衡、注意力特征聚合问题)。我们假设可变形卷积可作为融合前序算子优势的探索性替代方案,为三维基础骨干网络提供长程依赖建模、自适应空间聚合与计算高效性。本文提出三维DeformUX-Net,这是一种开创性的体素级CNN模型,能够有效规避传统ViT与大型卷积核的固有缺陷。具体而言,我们在深度可分离框架中重新审视体素可变形卷积,以计算高效的方式实现长程依赖建模。受卷积核权重结构重参数化思想的启发,我们通过设计并行分支(从$1\times1\times1$卷积开始)生成可变形三平面偏移量,从而在全部通道间实现自适应空间聚合。实验表明,在涵盖器官(KiTS: 0.680至0.720,MSD胰腺: 0.676至0.717,AMOS: 0.871至0.902)与血管(如MSD肝血管: 0.635至0.671)等不同尺度的四个公开挑战数据集上,三维DeformUX-Net在平均Dice系数上持续优于现有最优ViT与大型卷积核模型。