HyperZ$\cdot$Z$\cdot$W Operator Connects Slow-Fast Networks for Full Context Interaction

The self-attention mechanism utilizes large implicit weight matrices, programmed through dot product-based activations with very few trainable parameters, to enable long sequence modeling. In this paper, we investigate the possibility of discarding residual learning by employing large implicit kernels to achieve full context interaction at each layer of the network. To accomplish it, we introduce coordinate-based implicit MLPs as a slow network to generate hyper-kernels for another fast convolutional network. To get context-varying weights for fast dynamic encoding, we propose a $\mathrm{Hyper}\mathcal{Z{\cdot}Z{\cdot}W}$ operator that connects hyper-kernels ($\mathcal{W}$) and hidden activations ($\mathcal{Z}$) through simple elementwise multiplication, followed by convolution of $\mathcal{Z}$ using the context-dependent $\mathcal{W}$. Based on this design, we present a novel Terminator architecture that integrates hyper-kernels of different sizes to produce multi-branch hidden representations for enhancing the feature extraction capability of each layer. Additionally, a bottleneck layer is employed to compress the concatenated channels, allowing only valuable information to propagate to the subsequent layers. Notably, our model incorporates several innovative components and exhibits excellent properties, such as introducing local feedback error for updating the slow network, stable zero-mean features, faster training convergence, and fewer model parameters. Extensive experimental results on pixel-level 1D and 2D image classification benchmarks demonstrate the superior performance of our architecture.

翻译：自注意力机制利用大型隐式权重矩阵（通过基于点积的激活函数实现，仅含极少可训练参数）来实现长序列建模。本文探讨了通过采用大型隐式核在各网络层实现全上下文交互时，能否摒弃残差学习的问题。为此，我们引入基于坐标的隐式MLP作为慢网络，为另一个快速卷积网络生成超核。为了获得用于快速动态编码的上下文相关权重，我们提出了一种$\mathrm{Hyper}\mathcal{Z{\cdot}Z{\cdot}W}$算子，该算子通过简单的逐元素乘法连接超核（$\mathcal{W}$）与隐藏激活值（$\mathcal{Z}$），随后利用上下文相关的$\mathcal{W}$对$\mathcal{Z}$进行卷积。基于此设计，我们提出了一种新型终结者架构，该架构整合不同尺度的超核以生成多分支隐藏表示，从而增强各层的特征提取能力。此外，采用瓶颈层压缩拼接后的通道，仅允许有价值信息传播至后续层。值得注意的是，该模型集成了多项创新组件并展现出优越特性，例如引入局部反馈误差更新慢网络、稳定的零均值特征、更快的训练收敛速度以及更少的模型参数。在像素级1D和2D图像分类基准上的大量实验结果证明了我们架构的优越性能。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日