The self-attention mechanism utilizes large implicit weight matrices, programmed through dot product-based activations with very few trainable parameters, to enable long sequence modeling. In this paper, we investigate the possibility of discarding residual learning by employing large implicit kernels to achieve full context interaction at each layer of the network. To accomplish it, we introduce coordinate-based implicit MLPs as a slow network to generate hyper-kernels for another fast convolutional network. To get context-varying weights for fast dynamic encoding, we propose a $\mathrm{Hyper}\mathcal{Z{\cdot}Z{\cdot}W}$ operator that connects hyper-kernels ($\mathcal{W}$) and hidden activations ($\mathcal{Z}$) through simple elementwise multiplication, followed by convolution of $\mathcal{Z}$ using the context-dependent $\mathcal{W}$. Based on this design, we present a novel Terminator architecture that integrates hyper-kernels of different sizes to produce multi-branch hidden representations for enhancing the feature extraction capability of each layer. Additionally, a bottleneck layer is employed to compress the concatenated channels, allowing only valuable information to propagate to the subsequent layers. Notably, our model incorporates several innovative components and exhibits excellent properties, such as introducing local feedback error for updating the slow network, stable zero-mean features, faster training convergence, and fewer model parameters. Extensive experimental results on pixel-level 1D and 2D image classification benchmarks demonstrate the superior performance of our architecture.
翻译:自注意力机制利用大型隐式权重矩阵(通过基于点积的激活函数实现,仅含极少可训练参数)来实现长序列建模。本文探讨了通过采用大型隐式核在各网络层实现全上下文交互时,能否摒弃残差学习的问题。为此,我们引入基于坐标的隐式MLP作为慢网络,为另一个快速卷积网络生成超核。为了获得用于快速动态编码的上下文相关权重,我们提出了一种$\mathrm{Hyper}\mathcal{Z{\cdot}Z{\cdot}W}$算子,该算子通过简单的逐元素乘法连接超核($\mathcal{W}$)与隐藏激活值($\mathcal{Z}$),随后利用上下文相关的$\mathcal{W}$对$\mathcal{Z}$进行卷积。基于此设计,我们提出了一种新型终结者架构,该架构整合不同尺度的超核以生成多分支隐藏表示,从而增强各层的特征提取能力。此外,采用瓶颈层压缩拼接后的通道,仅允许有价值信息传播至后续层。值得注意的是,该模型集成了多项创新组件并展现出优越特性,例如引入局部反馈误差更新慢网络、稳定的零均值特征、更快的训练收敛速度以及更少的模型参数。在像素级1D和2D图像分类基准上的大量实验结果证明了我们架构的优越性能。