Self-supervised word embedding algorithms such as word2vec provide a minimal setting for studying representation learning in language modeling. We examine the quartic Taylor approximation of the word2vec loss around the origin, and we show that both the resulting training dynamics and the final performance on downstream tasks are empirically very similar to those of word2vec. Our main contribution is to analytically solve for both the gradient flow training dynamics and the final word embeddings in terms of only the corpus statistics and training hyperparameters. The solutions reveal that these models learn orthogonal linear subspaces one at a time, each one incrementing the effective rank of the embeddings until model capacity is saturated. Training on Wikipedia, we find that each of the top linear subspaces represents an interpretable topic-level concept. Finally, we apply our theory to describe how linear representations of more abstract semantic concepts emerge during training; these can be used to complete analogies via vector addition.
翻译:自监督词嵌入算法(如word2vec)为研究语言建模中的表示学习提供了一个极简的实验环境。我们考察了word2vec损失函数在原点附近的四次泰勒近似,并证明由此得到的训练动态及其在下游任务上的最终性能与原始word2vec在实验上高度相似。我们的主要贡献在于:仅依据语料统计量与训练超参数,解析地求解了梯度流训练动态及最终词嵌入表示。解析结果表明,这些模型依次学习正交的线性子空间,每个子空间逐步增加嵌入表示的有效秩,直至模型容量达到饱和。在维基百科语料上的训练结果显示,每个顶层线性子空间均对应一个可解释的主题级概念。最后,我们应用该理论描述了更抽象的语义概念之线性表示在训练过程中如何形成;这些表示可通过向量加法完成类比推理任务。