Convolutional Neural Networks (CNNs) rely on fixed-size kernels scanning local patches, which limits their ability to capture global context or long-range dependencies without very deep architectures. Vision Transformers (ViTs), in turn, provide global connectivity but lack spatial inductive bias, depend on explicit positional encodings, and remain tied to the initial patch size. Bridging these limitations requires a representation that is both structured and global. We introduce SONIC (Spectral Oriented Neural Invariant Convolutions), a continuous spectral parameterisation that models convolutional operators using a small set of shared, orientation-selective components. These components define smooth responses across the full frequency domain, yielding global receptive fields and filters that adapt naturally across resolutions. Across synthetic benchmarks, large-scale image classification, and 3D medical datasets, SONIC shows improved robustness to geometric transformations, noise, and resolution shifts, and matches or exceeds convolutional, attention-based, and prior spectral architectures with an order of magnitude fewer parameters. These results demonstrate that continuous, orientation-aware spectral parameterisations provide a principled and scalable alternative to conventional spatial and spectral operators.
翻译:卷积神经网络(CNN)依赖于固定尺寸的卷积核扫描局部图像块,这限制了其在无需极深架构的情况下捕捉全局上下文或长程依赖的能力。视觉Transformer(ViT)虽能提供全局连接性,但缺乏空间归纳偏置、依赖显式位置编码,且受限于初始图像块尺寸。要克服这些局限,需要一种兼具结构性与全局性的表示方法。我们提出SONIC(面向频谱的神经不变卷积),这是一种连续频谱参数化方法,它使用少量共享的、方向选择性的分量来建模卷积算子。这些分量在整个频域上定义平滑响应,从而产生全局感受野和能自然适应不同分辨率的滤波器。在合成基准测试、大规模图像分类以及三维医学数据集上的实验表明,SONIC对几何变换、噪声和分辨率变化具有更强的鲁棒性,并且以数量级更少的参数匹配或超越了传统卷积、基于注意力的以及先前的频谱架构。这些结果证明,连续且具有方向感知的频谱参数化为传统空间与频谱算子提供了一种原理性且可扩展的替代方案。