Strong inductive biases enable learning from little data and help generalization outside of the training distribution. Popular neural architectures such as Transformers lack strong structural inductive biases for seq2seq NLP tasks on their own. Consequently, they struggle with systematic generalization beyond the training distribution, e.g. with extrapolating to longer inputs, even when pre-trained on large amounts of text. We show how a structural inductive bias can be efficiently injected into a seq2seq model by pre-training it to simulate structural transformations on synthetic data. Specifically, we inject an inductive bias towards Finite State Transducers (FSTs) into a Transformer by pre-training it to simulate FSTs given their descriptions. Our experiments show that our method imparts the desired inductive bias, resulting in improved systematic generalization and better few-shot learning for FST-like tasks. Our analysis shows that fine-tuned models accurately capture the state dynamics of the unseen underlying FSTs, suggesting that the simulation process is internalized by the fine-tuned model.
翻译:强大的归纳偏置能够从少量数据中学习,并有助于训练分布之外的泛化。流行的神经网络架构(如Transformer)自身缺乏针对序列到序列自然语言处理任务的强结构归纳偏置。因此,即使在大规模文本上进行预训练,它们也难以实现超越训练分布的系统性泛化,例如在推断更长输入时表现不佳。我们展示了如何通过让模型在合成数据上预训练以模拟结构转换,从而将结构归纳偏置高效地注入序列到序列模型。具体而言,我们通过预训练Transformer根据给定描述模拟有限状态转换器(FST),从而向其注入对FST的归纳偏置。实验表明,我们的方法成功赋予了目标归纳偏置,在类FST任务上实现了更好的系统性泛化和少样本学习性能。分析显示,微调后的模型能准确捕捉未见底层FST的状态动态,表明模拟过程已被微调模型内化。