Despite the fact that Transformers perform well in NLP tasks, recent studies suggest that self-attention is theoretically limited in learning even some regular and context-free languages. These findings motivated us to think about their implications in modeling natural language, which is hypothesized to be mildly context-sensitive. We test the Transformer's ability to learn mildly context-sensitive languages of varying complexities, and find that they generalize well to unseen in-distribution data, but their ability to extrapolate to longer strings is worse than that of LSTMs. Our analyses show that the learned self-attention patterns and representations modeled dependency relations and demonstrated counting behavior, which may have helped the models solve the languages.
翻译:尽管Transformer在自然语言处理任务中表现出色,但近期研究表明,自注意力机制在理论上甚至难以学习某些正则语言和上下文无关语言。这些发现促使我们思考其对自然语言建模的影响——而自然语言被假设为具有轻度上下文敏感性。我们测试了Transformer学习不同复杂度轻度上下文相关语言的能力,发现它们能很好地泛化至未见过的分布内数据,但外推至更长字符串的能力弱于LSTM。我们的分析表明,学习到的自注意力模式和表征能够建模依赖关系并展现计数行为,这可能帮助模型解决了这些语言问题。