Vision modeling has advanced rapidly with Transformers, whose attention mechanisms capture visual dependencies but lack a principled account of how semantic information propagates spatially. We revisit this problem from a wave-based perspective: feature maps are treated as spatial signals whose evolution over an internal propagation time (aligned with network depth) is governed by an underdamped wave equation. In this formulation, spatial frequency-from low-frequency global layout to high-frequency edges and textures-is modeled explicitly, and its interaction with propagation time is controlled rather than implicitly fixed. We derive a closed-form, frequency-time decoupled solution and implement it as the Wave Propagation Operator (WPO), a lightweight module that models global interactions in O(N log N) time-far lower than attention. Building on WPO, we propose a family of WaveFormer models as drop-in replacements for standard ViTs and CNNs, achieving competitive accuracy across image classification, object detection, and semantic segmentation, while delivering up to 1.6x higher throughput and 30% fewer FLOPs than attention-based alternatives. Furthermore, our results demonstrate that wave propagation introduces a complementary modeling bias to heat-based methods, effectively capturing both global coherence and high-frequency details essential for rich visual semantics. Codes are available at: https://github.com/ZishanShu/WaveFormer.
翻译:随着Transformer的发展,视觉建模取得了快速进展,其注意力机制能够捕捉视觉依赖关系,但缺乏对语义信息如何在空间上传播的原理性解释。我们从波动视角重新审视这一问题:将特征图视为空间信号,其在内部传播时间(与网络深度对齐)上的演化由欠阻尼波动方程控制。在此公式中,空间频率——从低频全局布局到高频边缘与纹理——被显式建模,其与传播时间的相互作用受控而非隐式固定。我们推导出闭式解耦的频率-时间解,并将其实现为波传播算子(WPO),这是一种轻量级模块,能以O(N log N)时间复杂度建模全局交互——远低于注意力机制。基于WPO,我们提出WaveFormer模型系列作为标准ViT与CNN的即插即用替代方案,在图像分类、目标检测和语义分割任务中均取得有竞争力的精度,同时相比基于注意力的方案实现高达1.6倍的吞吐量提升与30%的FLOPs减少。此外,实验结果表明波传播为基于热传导的方法引入了互补的建模偏置,能有效捕捉丰富视觉语义所必需的全局连贯性与高频细节。代码发布于:https://github.com/ZishanShu/WaveFormer。