Despite their remarkable capabilities, LLMs learn word representations that exhibit the undesirable yet poorly understood feature of anisotropy. In this paper, we argue that the second moment in Adam is a cause of anisotropic embeddings, and suggest a modified optimizer called Coupled Adam to mitigate the problem. Our experiments demonstrate that Coupled Adam significantly improves the quality of embeddings, while also leading to better upstream and downstream performance on large enough datasets.
翻译:尽管大型语言模型(LLM)具备卓越的能力,但其学习的词表示仍表现出各向异性这一不良且尚未被充分理解的特征。本文认为,Adam优化器中的二阶矩是导致嵌入各向异性的原因之一,并提出一种称为耦合Adam的改进优化器以缓解该问题。实验表明,耦合Adam能显著提升嵌入质量,同时在足够大规模的数据集上带来更好的上游与下游任务性能。