Activation functions can have a significant impact on reducing the topological complexity of input data and therefore improve the performance of the model. Selecting a suitable activation function is an essential step in neural model design. However, the choice of activation function is seldom discussed or explored in Transformer-based language models. Their activation functions are chosen beforehand and then remain fixed from pre-training to fine-tuning. As a result, the inductive biases they imposed on models cannot be adjusted during this long life cycle. Moreover, subsequently developed models (e.g., RoBERTa, BART, and GPT-3) often follow up prior work (e.g., BERT) to use the same activation function without justification. In this paper, we investigate the effectiveness of using Rational Activation Function (RAF), a learnable activation function, in the Transformer architecture. In contrast to conventional, predefined activation functions, RAFs can adaptively learn optimal activation functions during training according to input data. Our experiments show the RAF-based Transformer (RAFT) achieves a lower validation perplexity than a vanilla BERT with the GELU function. We further evaluate RAFT on downstream tasks in low- and full-data settings. Our results show that RAFT outperforms the counterpart model across the majority of tasks and settings. For instance, RAFT outperforms vanilla BERT on the GLUE benchmark by 5.71 points on average in low-data scenario (where 100 training examples are available) and by 2.05 points on SQuAD in full-data setting. Analysis of the shapes of learned RAFs further unveils that they substantially vary between different layers of the pre-trained model and mostly look very different from conventional activation functions. RAFT opens a new research direction for analyzing and interpreting pre-trained models according to the learned activation functions.
翻译:激活函数对降低输入数据的拓扑复杂度具有显著影响,从而提升模型性能。选择合适的激活函数是神经模型设计的关键步骤。然而,在基于Transformer的语言模型中,激活函数的选择鲜少被讨论或探索。其激活函数在预训练前预先选定,并在预训练至微调的全过程中固定不变。因此,这些激活函数对模型施加的归纳偏置在整个生命周期内无法调整。此外,后续开发的模型(如RoBERTa、BART和GPT-3)往往沿用先前工作(如BERT)的激活函数,且未给出充分验证。本文研究了在Transformer架构中使用可学习激活函数——有理激活函数(RAF)的有效性。与传统的预定义激活函数不同,RAF能够根据输入数据在训练过程中自适应地学习最优激活函数。实验表明,基于RAF的Transformer(RAFT)相较于使用GELU函数的原始BERT获得了更低的验证困惑度。我们进一步在低数据与全数据场景的下游任务上评估了RAFT。结果显示,RAFT在大部分任务和场景中均优于对照模型。例如,在低数据场景(仅100个训练样本可用)下,RAFT在GLUE基准测试中平均比原始BERT高出5.71个点;在全数据场景下,在SQuAD数据集上高出2.05个点。对所学RAF形状的分析进一步揭示:预训练模型不同层之间的RAF形状存在显著差异,且大多与传统激活函数形态迥异。RAFT为根据学习到的激活函数分析和解释预训练模型开辟了新的研究方向。