Multi-layer perceptron (MLP) is a fundamental component of deep learning, and recent MLP-based architectures, especially the MLP-Mixer, have achieved significant empirical success. Nevertheless, our understanding of why and how the MLP-Mixer outperforms conventional MLPs remains largely unexplored. In this work, we reveal that sparseness is a key mechanism underlying the MLP-Mixers. First, the Mixers have an effective expression as a wider MLP with Kronecker-product weights, clarifying that the Mixers efficiently embody several sparseness properties explored in deep learning. In the case of linear layers, the effective expression elucidates an implicit sparse regularization caused by the model architecture and a hidden relation to Monarch matrices, which is also known as another form of sparse parameterization. Next, for general cases, we empirically demonstrate quantitative similarities between the Mixer and the unstructured sparse-weight MLPs. Following a guiding principle proposed by Golubeva, Neyshabur and Gur-Ari (2021), which fixes the number of connections and increases the width and sparsity, the Mixers can demonstrate improved performance.
翻译:多层感知器(MLP)是深度学习的基础组件,近期基于MLP的架构(尤其是MLP-Mixer)已取得显著的实证成功。然而,我们对MLP-Mixer为何及如何优于传统MLP的理解仍尚待探索。本文揭示稀疏性是MLP-Mixer的核心机制。首先,Mixer可等效表达为具有Kronecker积权重的更宽MLP,阐明其高效整合了深度学习中探索的多种稀疏特性。在线性层情形下,该等效表达揭示了模型架构隐式施加的稀疏正则化及与Monarch矩阵(另一种稀疏参数化形式)的隐藏关联。其次,针对一般情形,我们通过实证验证了Mixer与无结构稀疏权重MLP之间的定量相似性。遵循Golubeva、Neyshabur及Gur-Ari(2021)提出的指导原则——固定连接数量、增加宽度与稀疏度,Mixer可表现更优性能。