Large kernel convolutions offer a scalable alternative to vision transformers for high-resolution 3D volumetric analysis, yet naively increasing kernel size often leads to optimization instability. Motivated by the spatial bias inherent in effective receptive fields (ERFs), we theoretically demonstrate that structurally re-parameterized blocks induce spatially varying learning rates that are crucial for convergence. Leveraging this insight, we introduce Rep3D, a framework that employs a lightweight modulation network to generate receptive-biased scaling masks, adaptively re-weighting kernel updates within a plain encoder architecture. This approach unifies spatial inductive bias with optimization-aware learning, avoiding the complexity of multi-branch designs while ensuring robust local-to-global convergence. Extensive evaluations on five 3D segmentation benchmarks demonstrate that Rep3D consistently outperforms state-of-the-art transformer and fixed-prior baselines. The source code is publicly available at https://github.com/leeh43/Rep3D.
翻译:大核卷积为高分辨率三维体数据分析提供了可扩展的视觉Transformer替代方案,然而简单地增大核尺寸常导致优化不稳定。受有效感受野(ERFs)固有空间偏置的启发,我们从理论上证明了结构重参数化模块会引发对收敛至关重要的空间变化学习率。基于这一洞见,我们提出了Rep3D框架,该框架采用轻量级调制网络生成感受野偏置的缩放掩码,在普通编码器架构内自适应地重新加权核更新。该方法将空间归纳偏置与优化感知学习相统一,避免了多分支设计的复杂性,同时确保了从局部到全局的鲁棒收敛。在五个三维分割基准上的广泛评估表明,Rep3D始终优于最先进的Transformer和固定先验基线模型。源代码公开于https://github.com/leeh43/Rep3D。