Self-supervised word embedding algorithms such as word2vec provide a minimal setting for studying representation learning in language modeling. We examine the quartic Taylor approximation of the word2vec loss around the origin, and we show that both the resulting training dynamics and the final performance on downstream tasks are empirically very similar to those of word2vec. Our main contribution is to analytically solve for both the gradient flow training dynamics and the final word embeddings in terms of only the corpus statistics and training hyperparameters. The solutions reveal that these models learn orthogonal linear subspaces one at a time, each one incrementing the effective rank of the embeddings until model capacity is saturated. Training on Wikipedia, we find that each of the top linear subspaces represents an interpretable topic-level concept. Finally, we apply our theory to describe how linear representations of more abstract semantic concepts emerge during training; these can be used to complete analogies via vector addition.
翻译:诸如word2vec等自监督词嵌入算法为语言建模中的表示学习提供了一个极简研究框架。我们考察了word2vec损失函数在原点附近的四次泰勒近似,并证明由此产生的训练动态及在下游任务上的最终性能与原始word2vec在经验上高度相似。我们的核心贡献在于:仅依据语料统计量与训练超参数,解析地求解出梯度流训练动态与最终词嵌入的闭式表达式。这些解揭示出此类模型依次学习正交线性子空间,每个子空间逐步增加嵌入的有效秩直至模型容量饱和。在维基百科语料上的训练实验表明,每个顶层线性子空间都对应一个可解释的主题级概念。最后,我们应用该理论阐释了更具抽象性的语义概念之线性表示如何在训练过程中形成;这些表示可通过向量加法完成类比推理任务。