Transformer-based architectures have advanced medical image analysis by effectively modeling long-range dependencies, yet they often struggle in 3D settings due to substantial memory overhead and insufficient capture of fine-grained local features. We address these limitations with WaveFormer, a novel 3D-transformer that: i) leverages the fundamental frequency-domain properties of features for contextual representation, and ii) is inspired by the top-down mechanism of the human visual recognition system, making it a biologically motivated architecture. By employing discrete wavelet transformations (DWT) at multiple scales, WaveFormer preserves both global context and high-frequency details while replacing heavy upsampling layers with efficient wavelet-based summarization and reconstruction. This significantly reduces the number of parameters, which is critical for real-world deployment where computational resources and training times are constrained. Furthermore, the model is generic and easily adaptable to diverse applications. Evaluations on BraTS2023, FLARE2021, and KiTS2023 demonstrate performance on par with state-of-the-art methods while offering substantially lower computational complexity.
翻译:基于Transformer的架构通过有效建模长距离依赖关系推动了医学图像分析的发展,但在三维场景中常因巨大的内存开销和细粒度局部特征捕捉不足而受限。本文提出的WaveFormer作为一种新型三维Transformer解决了这些局限性:i) 利用特征在频域的基本特性进行上下文表示,ii) 受人类视觉识别系统自上而下机制的启发,形成具有生物学动机的架构。通过在多尺度上应用离散小波变换,WaveFormer在保留全局上下文与高频细节的同时,以高效的小波基总结与重建模块替代了繁重的上采样层。这显著减少了参数量,对于计算资源和训练时间受限的实际部署场景至关重要。此外,该模型具有通用性,可轻松适配多种应用场景。在BraTS2023、FLARE2021和KiTS2023数据集上的评估表明,本模型在保持与最先进方法相当性能的同时,显著降低了计算复杂度。