Despite the widespread success of Transformers on NLP tasks, recent works have found that they struggle to model several formal languages when compared to recurrent models. This raises the question of why Transformers perform well in practice and whether they have any properties that enable them to generalize better than recurrent models. In this work, we conduct an extensive empirical study on Boolean functions to demonstrate the following: (i) Random Transformers are relatively more biased towards functions of low sensitivity. (ii) When trained on Boolean functions, both Transformers and LSTMs prioritize learning functions of low sensitivity, with Transformers ultimately converging to functions of lower sensitivity. (iii) On sparse Boolean functions which have low sensitivity, we find that Transformers generalize near perfectly even in the presence of noisy labels whereas LSTMs overfit and achieve poor generalization accuracy. Overall, our results provide strong quantifiable evidence that suggests differences in the inductive biases of Transformers and recurrent models which may help explain Transformer's effective generalization performance despite relatively limited expressiveness.
翻译:尽管Transformer在自然语言处理任务中取得了广泛成功,但近年来的研究发现,与循环模型相比,它们在建模多种形式语言时存在困难。这引发了一个问题:为何Transformer在实践中表现优异?它们是否具备某些比循环模型更利于泛化的特性?本研究通过布尔函数的大规模实证分析揭示了以下发现:(i) 随机初始化的Transformer对低敏感度函数存在相对更强的偏好;(ii) 在训练布尔函数时,Transformer和LSTM均优先学习低敏感度函数,而Transformer最终会收敛至敏感度更低的函数;(iii) 对于低敏感度的稀疏布尔函数,即使标签存在噪声,Transformer仍能实现近乎完美的泛化,而LSTM则出现过拟合且泛化精度较差。总体而言,我们的研究提供了可量化的强有力证据,表明Transformer与循环模型在归纳偏置上存在差异,这或可解释Transformer在表达能力相对有限的情况下仍能实现高效泛化的原因。