Transformer-based architectures have advanced medical image analysis by effectively modeling long-range dependencies, yet they often struggle in 3D settings due to substantial memory overhead and insufficient capture of fine-grained local features. We address these limi- tations with WaveFormer, a novel 3D-transformer that: i) leverages the fundamental frequency-domain properties of features for contextual rep- resentation, and ii) is inspired by the top-down mechanism of the human visual recognition system, making it a biologically motivated architec- ture. By employing discrete wavelet transformations (DWT) at multiple scales, WaveFormer preserves both global context and high-frequency de- tails while replacing heavy upsampling layers with efficient wavelet-based summarization and reconstruction. This significantly reduces the number of parameters, which is critical for real-world deployment where compu- tational resources and training times are constrained. Furthermore, the model is generic and easily adaptable to diverse applications. Evaluations on BraTS2023, FLARE2021, and KiTS2023 demonstrate performance on par with state-of-the-art methods while offering substantially lower computational complexity.
翻译:基于Transformer的架构通过有效建模长距离依赖关系,推动了医学图像分析的发展,但在三维场景中常因巨大的内存开销和细粒度局部特征捕捉不足而受限。本文提出的WaveFormer是一种新型三维Transformer,旨在解决这些局限性:i) 利用特征在频域的基本特性进行上下文表示;ii) 受人类视觉识别系统自上而下机制的启发,构建具有生物学动机的架构。通过在多尺度上采用离散小波变换(DWT),WaveFormer在保留全局上下文与高频细节的同时,以高效的小波基总结与重建层替代了繁重的上采样层。这显著减少了参数量,对于计算资源和训练时间受限的实际部署场景至关重要。此外,该模型具有通用性,可轻松适配不同应用场景。在BraTS2023、FLARE2021和KiTS2023数据集上的评估表明,其性能与现有先进方法相当,同时计算复杂度显著降低。