Transformer-based models have delivered impressive results on many tasks, particularly vision and language tasks. In many model training situations, conventional configurations are typically adopted. For example, we often set the base model with hidden dimensions (i.e. model width) to be 768 and the number of transformer layers (i.e. model depth) to be 12. In this paper, we revisit these conventional configurations. Through theoretical analysis and experimental evaluation, we show that the masked autoencoder is effective in alleviating the over-smoothing issue in deep transformer training. Based on this finding, we propose Bamboo, an idea of using deeper and narrower transformer configurations, for masked autoencoder training. On ImageNet, with such a simple change in configuration, re-designed model achieves 87.1% top-1 accuracy and outperforms SoTA models like MAE and BEiT. On language tasks, re-designed model outperforms BERT with default setting by 1.1 points on average, on GLUE datasets.
翻译:基于Transformer的模型已在许多任务上取得了显著成果,特别是在视觉和语言任务中。在许多模型训练场景中,通常采用传统配置。例如,我们常将基础模型的隐藏维度(即模型宽度)设为768,Transformer层数(即模型深度)设为12。本文重新审视了这些传统配置。通过理论分析和实验评估,我们发现掩码自编码器能有效缓解深度Transformer训练中的过度平滑问题。基于这一发现,我们提出Bamboo方法,即采用更深更窄的Transformer配置进行掩码自编码器训练。在ImageNet上,仅通过这种配置的简单调整,重新设计的模型便达到了87.1%的top-1准确率,优于MAE和BEiT等当前最优模型。在语言任务中,重新设计的模型在GLUE数据集上平均比默认设置的BERT高出1.1个点。