While the successes of transformers across many domains are indisputable, accurate understanding of the learning mechanics is still largely lacking. Their capabilities have been probed on benchmarks which include a variety of structured and reasoning tasks -- but mathematical understanding is lagging substantially behind. Recent lines of work have begun studying representational aspects of this question: that is, the size/depth/complexity of attention-based networks to perform certain tasks. However, there is no guarantee the learning dynamics will converge to the constructions proposed. In our paper, we provide fine-grained mechanistic understanding of how transformers learn "semantic structure", understood as capturing co-occurrence structure of words. Precisely, we show, through a combination of experiments on synthetic data modeled by Latent Dirichlet Allocation (LDA), Wikipedia data, and mathematical analysis that the embedding layer and the self-attention layer encode the topical structure. In the former case, this manifests as higher average inner product of embeddings between same-topic words. In the latter, it manifests as higher average pairwise attention between same-topic words. The mathematical results involve several assumptions to make the analysis tractable, which we verify on data, and might be of independent interest as well.
翻译:尽管Transformer在众多领域的成功无可争议,但其学习机制的准确理解仍然在很大程度上缺失。这些模型在包含各类结构化与推理任务的基准测试中已被广泛评估,然而数学层面的理解远远滞后。近年来的研究方向开始探索这一问题的表征层面:即基于注意力机制的网络完成特定任务所需的规模/深度/复杂度。但并不能保证学习动态会收敛到所提出的构造方案。在本文中,我们提供了关于Transformer如何学习"语义结构"(即捕捉词语共现结构)的细粒度机制性理解。具体而言,通过潜狄利克雷分配(LDA)建模的合成数据实验、维基百科数据实验以及数学分析,我们证明嵌入层和自注意力层编码了主题结构。在前者中,这表现为同一主题词语嵌入之间的平均内积更高;在后者中,则表现为同一主题词语之间的平均成对注意力更高。数学结果涉及若干假设以简化分析,这些假设已在数据中得到验证,且可能具有独立的研究价值。