With the inspiration of vision transformers, the concept of depth-wise convolution revisits to provide a large Effective Receptive Field (ERF) using Large Kernel (LK) sizes for medical image segmentation. However, the segmentation performance might be saturated and even degraded as the kernel sizes scaled up (e.g., $21\times 21\times 21$) in a Convolutional Neural Network (CNN). We hypothesize that convolution with LK sizes is limited to maintain an optimal convergence for locality learning. While Structural Re-parameterization (SR) enhances the local convergence with small kernels in parallel, optimal small kernel branches may hinder the computational efficiency for training. In this work, we propose RepUX-Net, a pure CNN architecture with a simple large kernel block design, which competes favorably with current network state-of-the-art (SOTA) (e.g., 3D UX-Net, SwinUNETR) using 6 challenging public datasets. We derive an equivalency between kernel re-parameterization and the branch-wise variation in kernel convergence. Inspired by the spatial frequency in the human visual system, we extend to vary the kernel convergence into element-wise setting and model the spatial frequency as a Bayesian prior to re-parameterize convolutional weights during training. Specifically, a reciprocal function is leveraged to estimate a frequency-weighted value, which rescales the corresponding kernel element for stochastic gradient descent. From the experimental results, RepUX-Net consistently outperforms 3D SOTA benchmarks with internal validation (FLARE: 0.929 to 0.944), external validation (MSD: 0.901 to 0.932, KiTS: 0.815 to 0.847, LiTS: 0.933 to 0.949, TCIA: 0.736 to 0.779) and transfer learning (AMOS: 0.880 to 0.911) scenarios in Dice Score.
翻译:受视觉Transformer启发,深度可分离卷积的概念重新兴起,其通过使用大卷积核尺寸为医学图像分割提供大有效感受野。然而,当卷积核尺寸持续扩展(如$21\times 21\times 21$)时,卷积神经网络的分割性能可能趋于饱和甚至退化。我们假设大尺寸卷积核在维持局部学习的最优收敛性方面存在局限。尽管结构重参数化通过并行小卷积核增强局部收敛性,但最优小卷积核分支可能损害训练时的计算效率。本文提出RepUX-Net——一种纯卷积神经网络架构,采用简洁的大卷积核块设计,在6个具有挑战性的公开数据集上与当前网络最先进模型(如3D UX-Net、SwinUNETR)相比具有竞争力。我们推导出卷积核重参数化与分支级收敛差异之间的等价关系。受人类视觉系统空间频率启发,我们进一步将卷积核收敛差异扩展至元素级设置,并将空间频率建模为贝叶斯先验以在训练过程中重参数化卷积权重。具体而言,利用倒数函数估计频率加权值,该值重新缩放相应卷积核元素以进行随机梯度下降。实验结果表明,RepUX-Net在Dice分数上始终优于三维最先进基准:内部验证(FLARE:0.929提升至0.944)、外部验证(MSD:0.901至0.932、KiTS:0.815至0.847、LiTS:0.933至0.949、TCIA:0.736至0.779)及迁移学习(AMOS:0.880至0.911)场景。