Text embeddings enable various applications, but their performance deteriorates on longer texts. In this paper, we find that the performance degradation is due to a phenomenon called Length Collapse, where longer text embeddings collapse into a narrow space. This collapse results in a distributional inconsistency between embeddings of different text lengths, ultimately hurting the performance of downstream tasks. Theoretically, by considering the self-attention mechanism inherently functions as a low-pass filter, we prove that long sequences increase the attenuation rate of the low-pass filter effect of the self-attention mechanism. With layers going deeper, excessive low-pass filtering causes the token signals to retain only their Direct-Current (DC) component, which means the input token feature maps will collapse into a narrow space, especially in long texts. Based on the above analysis, we propose to mitigate the undesirable length collapse limitation by introducing a temperature in softmax(), which achieves a higher low-filter attenuation rate. The tuning-free method, called TempScale, can be plugged into multiple transformer-based embedding models. Empirically, we demonstrate that TempScale can improve existing embedding models, especially on long text inputs, bringing up to 0.53% performance gains on 40 datasets from Massive Text Embedding Benchmark (MTEB) and 0.82% performance gains on 4 datasets from LongEmbed, which specifically focuses on long context retrieval.
翻译:文本嵌入技术支撑着多种应用,但其性能在长文本上会出现退化。本文发现,这种性能下降源于一种称为“长度坍缩”的现象,即较长的文本嵌入会坍缩到一个狭窄的空间中。这种坍缩导致不同文本长度的嵌入之间存在分布不一致性,最终损害下游任务的性能。理论上,通过分析自注意力机制本质上充当低通滤波器的作用,我们证明了长序列会加剧自注意力机制低通滤波效应的衰减速率。随着网络层数加深,过度的低通滤波会导致词元信号仅保留其直流分量,这意味着输入词元特征图将坍缩到一个狭窄的空间,尤其在长文本中更为显著。基于以上分析,我们提出通过在softmax()中引入温度参数来缓解不良的长度坍缩限制,该方法能够实现更高的低通滤波衰减速率。这种无需调参的方法,称为TempScale,可以便捷地集成到多种基于Transformer的嵌入模型中。实验表明,TempScale能够有效改进现有嵌入模型,特别是在长文本输入上,在Massive Text Embedding Benchmark (MTEB)的40个数据集上带来最高0.53%的性能提升,在专门针对长上下文检索的LongEmbed的4个数据集上获得最高0.82%的性能提升。