Token uniformity is commonly observed in transformer-based models, in which different tokens share a large proportion of similar information after going through stacked multiple self-attention layers in a transformer. In this paper, we propose to use the distribution of singular values of outputs of each transformer layer to characterise the phenomenon of token uniformity and empirically illustrate that a less skewed singular value distribution can alleviate the `token uniformity' problem. Base on our observations, we define several desirable properties of singular value distributions and propose a novel transformation function for updating the singular values. We show that apart from alleviating token uniformity, the transformation function should preserve the local neighbourhood structure in the original embedding space. Our proposed singular value transformation function is applied to a range of transformer-based language models such as BERT, ALBERT, RoBERTa and DistilBERT, and improved performance is observed in semantic textual similarity evaluation and a range of GLUE tasks. Our source code is available at https://github.com/hanqi-qi/tokenUni.git.
翻译:标记一致性现象普遍存在于基于Transformer的模型中,即在经过堆叠的多层自注意力层处理后,不同标记共享大量相似信息。本文提出利用各Transformer层输出奇异值的分布特征来刻画标记一致性现象,并通过实验表明,偏斜程度更低的奇异值分布能够缓解"标记一致性"问题。基于观察结果,我们定义了奇异值分布的若干理想性质,并提出一种新型变换函数用于更新奇异值。研究表明,该变换函数在缓解标记一致性的同时,还应保持原始嵌入空间中的局部邻域结构。我们将所提出的奇异值变换函数应用于BERT、ALBERT、RoBERTa及DistilBERT等主流基于Transformer的语言模型,在语义文本相似度评测及多项GLUE任务中均观察到性能提升。源代码已开源至https://github.com/hanqi-qi/tokenUni.git。