Transformer-based architectures are the model of choice for natural language understanding, but they come at a significant cost, as they have quadratic complexity in the input length, require a lot of training data, and can be difficult to tune. In the pursuit of lower costs, we investigate simple MLP-based architectures. We find that existing architectures such as MLPMixer, which achieves token mixing through a static MLP applied to each feature independently, are too detached from the inductive biases required for natural language understanding. In this paper, we propose a simple variant, HyperMixer, which forms the token mixing MLP dynamically using hypernetworks. Empirically, we demonstrate that our model performs better than alternative MLP-based models, and on par with Transformers. In contrast to Transformers, HyperMixer achieves these results at substantially lower costs in terms of processing time, training data, and hyperparameter tuning.
翻译:基于Transformer的架构是自然语言理解任务的首选模型,但其代价显著:输入长度的二次复杂度、对大量训练数据的需求以及调参困难。为追求更低成本,我们研究了简单的MLP架构。我们发现现有架构(如MLPMixer)通过独立作用于每个特征的静态MLP实现token混合,其归纳偏置与自然语言理解的需求严重脱节。本文提出一种简洁变体——HyperMixer,利用超网络动态生成token混合MLP。实验表明,我们的模型性能优于其他基于MLP的模型,并与Transformer相当。与Transformer相比,HyperMixer在显著降低处理时间、训练数据和超参数调优成本的情况下达成此效果。