Contextualized embeddings vary by context, even for the same token, and form a distribution in the embedding space. To analyze this distribution, we focus on the norm of the mean embedding and the variance of the embeddings. In this study, we first demonstrate that these values follow the well-known formula for variance in statistics and provide an efficient sequential computation method. Then, by observing embeddings from intermediate layers of several Transformer models, we found a strong trade-off relationship between the norm and the variance: as the mean embedding becomes closer to the origin, the variance increases. This trade-off is likely influenced by the layer normalization mechanism used in Transformer models. Furthermore, when the sets of token embeddings are treated as clusters, we show that the variance of the entire embedding set can theoretically be decomposed into the within-cluster variance and the between-cluster variance. We found experimentally that as the layers of Transformer models deepen, the embeddings move farther from the origin, the between-cluster variance relatively decreases, and the within-cluster variance relatively increases. These results are consistent with existing studies on the anisotropy of the embedding spaces across layers.
翻译:上下文嵌入会因语境而异,即使对于同一词元,也会在嵌入空间中形成分布。为分析该分布,我们聚焦于平均嵌入的范数与嵌入的方差。本研究首先证明这些数值遵循统计学中著名的方差公式,并提出一种高效的序列计算方法。随后,通过观察多个Transformer模型中间层的嵌入,我们发现范数与方差之间存在显著的权衡关系:当平均嵌入越接近原点时,方差越大。这种权衡关系很可能受到Transformer模型中使用的层归一化机制影响。此外,当将词元嵌入集合视为聚类时,我们从理论上证明整个嵌入集合的方差可分解为簇内方差与簇间方差。实验发现,随着Transformer模型层数加深,嵌入会远离原点,簇间方差相对减小,而簇内方差相对增大。这些结果与现有关于各层嵌入空间各向异性的研究结论一致。