Visual long-range interaction refers to modeling dependencies between distant feature points or blocks within an image, which can significantly enhance the model's robustness. Both CNN and Transformer can establish long-range interactions through layering and patch calculations. However, the underlying mechanism of long-range interaction in visual space remains unclear. We propose the mode-locking theory as the underlying mechanism, which constrains the phase and wavelength relationship between waves to achieve mode-locked interference waveform. We verify this theory through simulation experiments and demonstrate the mode-locking pattern in real-world scene models. Our proposed theory of long-range interaction provides a comprehensive understanding of the mechanism behind this phenomenon in artificial neural networks. This theory can inspire the integration of the mode-locking pattern into models to enhance their robustness.
翻译:视觉长程相互作用是指对图像中距离较远的特征点或区块之间的依赖关系进行建模,这一机制能够显著提升模型的鲁棒性。卷积神经网络(CNN)与Transformer均可通过分层计算与局部块处理建立长程相互作用。然而,视觉空间中长程相互作用的内在机制仍不明确。我们提出锁模理论作为其底层机制,该理论通过约束波的相位与波长关系,形成锁模干涉波形。我们通过仿真实验验证了该理论,并在真实场景模型中展示了锁模模式。所提出的长程相互作用理论为理解人工神经网络中该现象的内在机制提供了全面解释,可启发将锁模模式融入模型以增强其鲁棒性。