Multi-layer perceptron (MLP) is a fundamental component of deep learning that has been extensively employed for various problems. However, recent empirical successes in MLP-based architectures, particularly the progress of the MLP-Mixer, have revealed that there is still hidden potential in improving MLPs to achieve better performance. In this study, we reveal that the MLP-Mixer works effectively as a wide MLP with certain sparse weights. Initially, we clarify that the mixing layer of the Mixer has an effective expression as a wider MLP whose weights are sparse and represented by the Kronecker product. This expression naturally defines a permuted-Kronecker (PK) family, which can be regarded as a general class of mixing layers and is also regarded as an approximation of Monarch matrices. Subsequently, because the PK family effectively constitutes a wide MLP with sparse weights, one can apply the hypothesis proposed by Golubeva, Neyshabur and Gur-Ari (2021) that the prediction performance improves as the width (sparsity) increases when the number of weights is fixed. We empirically verify this hypothesis by maximizing the effective width of the MLP-Mixer, which enables us to determine the appropriate size of the mixing layers quantitatively.
翻译:多层感知机(MLP)是深度学习的基础组件,已被广泛应用于各类问题。然而,近期基于MLP架构的实证成功,特别是MLP-Mixer的进展,揭示了改进MLP以提升性能仍存在潜在空间。本研究中,我们揭示了MLP-Mixer实际上作为一种带有特定稀疏权重的宽MLP在有效工作。首先,我们阐明Mixer的混合层具有一种有效表达形式,即表现为权重稀疏且由克罗内克积表示的更宽MLP。该表达自然定义了一个置换-克罗内克(PK)族,可被视为混合层的一般类别,同时也是对Monarch矩阵的近似。进而,由于PK族有效构成了带有稀疏权重的宽MLP,我们可以应用Golubeva、Neyshabur和Gur-Ari(2021)提出的假设:当权重数量固定时,预测性能随宽度(稀疏性)增加而提升。我们通过最大化MLP-Mixer的有效宽度来实证验证这一假设,从而能定量确定混合层的合适尺寸。