DeformUX-Net: Exploring a 3D Foundation Backbone for Medical Image Segmentation with Depthwise Deformable Convolution

The application of 3D ViTs to medical image segmentation has seen remarkable strides, somewhat overshadowing the budding advancements in Convolutional Neural Network (CNN)-based models. Large kernel depthwise convolution has emerged as a promising technique, showcasing capabilities akin to hierarchical transformers and facilitating an expansive effective receptive field (ERF) vital for dense predictions. Despite this, existing core operators, ranging from global-local attention to large kernel convolution, exhibit inherent trade-offs and limitations (e.g., global-local range trade-off, aggregating attentional features). We hypothesize that deformable convolution can be an exploratory alternative to combine all advantages from the previous operators, providing long-range dependency, adaptive spatial aggregation and computational efficiency as a foundation backbone. In this work, we introduce 3D DeformUX-Net, a pioneering volumetric CNN model that adeptly navigates the shortcomings traditionally associated with ViTs and large kernel convolution. Specifically, we revisit volumetric deformable convolution in depth-wise setting to adapt long-range dependency with computational efficiency. Inspired by the concepts of structural re-parameterization for convolution kernel weights, we further generate the deformable tri-planar offsets by adapting a parallel branch (starting from $1\times1\times1$ convolution), providing adaptive spatial aggregation across all channels. Our empirical evaluations reveal that the 3D DeformUX-Net consistently outperforms existing state-of-the-art ViTs and large kernel convolution models across four challenging public datasets, spanning various scales from organs (KiTS: 0.680 to 0.720, MSD Pancreas: 0.676 to 0.717, AMOS: 0.871 to 0.902) to vessels (e.g., MSD hepatic vessels: 0.635 to 0.671) in mean Dice.

翻译：近年来，三维视觉Transformer在医学图像分割领域的应用取得了显著进展，在一定程度上掩盖了基于卷积神经网络（CNN）模型的蓬勃发展。大核深度可分离卷积展现出与层级式Transformer相媲美的潜力，能够形成对密集预测至关重要的扩展有效感受野。然而，现有核心算子（从全局-局部注意力到大型卷积核）均存在固有折衷与局限性（如全局-局部范围权衡、注意力特征聚合问题）。我们假设可变形卷积可作为融合前序算子优势的探索性替代方案，为三维基础骨干网络提供长程依赖建模、自适应空间聚合与计算高效性。本文提出三维DeformUX-Net，这是一种开创性的体素级CNN模型，能够有效规避传统ViT与大型卷积核的固有缺陷。具体而言，我们在深度可分离框架中重新审视体素可变形卷积，以计算高效的方式实现长程依赖建模。受卷积核权重结构重参数化思想的启发，我们通过设计并行分支（从$1\times1\times1$卷积开始）生成可变形三平面偏移量，从而在全部通道间实现自适应空间聚合。实验表明，在涵盖器官（KiTS: 0.680至0.720，MSD胰腺: 0.676至0.717，AMOS: 0.871至0.902）与血管（如MSD肝血管: 0.635至0.671）等不同尺度的四个公开挑战数据集上，三维DeformUX-Net在平均Dice系数上持续优于现有最优ViT与大型卷积核模型。