While the successes of transformers across many domains are indisputable, accurate understanding of the learning mechanics is still largely lacking. Their capabilities have been probed on benchmarks which include a variety of structured and reasoning tasks -- but mathematical understanding is lagging substantially behind. Recent lines of work have begun studying representational aspects of this question: that is, the size/depth/complexity of attention-based networks to perform certain tasks. However, there is no guarantee the learning dynamics will converge to the constructions proposed. In our paper, we provide fine-grained mechanistic understanding of how transformers learn "semantic structure", understood as capturing co-occurrence structure of words. Precisely, we show, through a combination of mathematical analysis and experiments on Wikipedia data and synthetic data modeled by Latent Dirichlet Allocation (LDA), that the embedding layer and the self-attention layer encode the topical structure. In the former case, this manifests as higher average inner product of embeddings between same-topic words. In the latter, it manifests as higher average pairwise attention between same-topic words. The mathematical results involve several assumptions to make the analysis tractable, which we verify on data, and might be of independent interest as well.
翻译:尽管Transformer在众多领域的成功无可争议,但对其学习机制的准确理解仍存在显著不足。其能力已在包含多种结构化与推理任务的基准测试中得到验证——然而数学层面的理解严重滞后。近期研究开始探索该问题的表征层面:即基于注意力机制的网络执行特定任务所需规模/深度/复杂度。但学习动态能否收敛至所提构造仍无保障。本文通过数学分析与维基百科数据及潜在狄利克雷分配(LDA)建模合成数据的实验相结合,精细揭示了Transformer学习"语义结构"(即捕捉词语共现结构)的机制。具体而言,我们证明嵌入层与自注意力层编码了主题结构:前者表现为同主题词嵌入的平均内积更高,后者则体现为同主题词间的平均成对注意力更高。数学结果基于若干使分析可处理的假设(我们在数据上验证了这些假设),这些假设或具有独立研究价值。