A core component present in many successful neural network architectures, is an MLP block of two fully connected layers with a non-linear activation in between. An intriguing phenomenon observed empirically, including in transformer architectures, is that, after training, the activations in the hidden layer of this MLP block tend to be extremely sparse on any given input. Unlike traditional forms of sparsity, where there are neurons/weights which can be deleted from the network, this form of {\em dynamic} activation sparsity appears to be harder to exploit to get more efficient networks. Motivated by this we initiate a formal study of PAC learnability of MLP layers that exhibit activation sparsity. We present a variety of results showing that such classes of functions do lead to provable computational and statistical advantages over their non-sparse counterparts. Our hope is that a better theoretical understanding of {\em sparsely activated} networks would lead to methods that can exploit activation sparsity in practice.
翻译:许多成功的神经网络架构中普遍存在一个核心组件,即由两个全连接层构成的MLP模块,中间包含一个非线性激活函数。一个经验观察到的有趣现象(包括在Transformer架构中)是:训练完成后,该MLP模块隐藏层的激活值在任意给定输入上往往呈现极端稀疏性。与传统形式的稀疏性(即存在可从网络中删除的神经元/权重)不同,这种动态激活稀疏性似乎更难被有效利用以获得更高效率的网络。受此启发,我们首次对具有激活稀疏性的MLP层进行了PAC可学习性的系统研究。我们提出了多项结果,证明此类函数确实能带来可证明的计算与统计优势,优于非稀疏的对应结构。我们期望通过对稀疏激活网络更深入的理论理解,能够催生出在实践中有效利用激活稀疏性的方法。