The representation degeneration problem is a phenomenon that is widely observed among self-supervised learning methods based on Transformers. In NLP, it takes the form of anisotropy, a singular property of hidden representations which makes them unexpectedly close to each other in terms of angular distance (cosine-similarity). Some recent works tend to show that anisotropy is a consequence of optimizing the cross-entropy loss on long-tailed distributions of tokens. We show in this paper that anisotropy can also be observed empirically in language models with specific objectives that should not suffer directly from the same consequences. We also show that the anisotropy problem extends to Transformers trained on other modalities. Our observations suggest that anisotropy is actually inherent to Transformers-based models.
翻译:表示退化问题是基于Transformer的自监督学习方法中广泛存在的现象。在自然语言处理领域,它表现为各向异性——隐藏表示的一种奇异特性,使得它们在角度距离(余弦相似度)上意外地彼此接近。近期一些研究倾向于表明,各向异性是优化长尾词元分布上的交叉熵损失函数的结果。本文证明,在具有特定目标(不应直接受相同影响)的语言模型中,各向异性同样可以通过经验观测到。我们还发现各向异性问题会延伸至基于其他模态训练的Transformer。我们的观察表明,各向异性实际上是基于Transformer的模型的内在属性。