Sentence representations are foundational to many Natural Language Processing (NLP) applications. While recent methods leverage Large Language Models (LLMs) to derive sentence representations, most rely on final-layer hidden states, which are optimized for next-token prediction and thus often fail to capture global, sentence-level semantics. This paper introduces a novel perspective, demonstrating that attention value vectors capture sentence semantics more effectively than hidden states. We propose Value Aggregation (VA), a simple method that pools token values across multiple layers and token indices. In a training-free setting, VA outperforms other LLM-based embeddings, even matches or surpasses the ensemble-based MetaEOL. Furthermore, we demonstrate that when paired with suitable prompts, the layer attention outputs can be interpreted as aligned weighted value vectors. Specifically, the attention scores of the last token function as the weights, while the output projection matrix ($W_O$) aligns these weighted value vectors with the common space of the LLM residual stream. This refined method, termed Aligned Weighted VA (AlignedWVA), achieves state-of-the-art performance among training-free LLM-based embeddings, outperforming the high-cost MetaEOL by a substantial margin. Finally, we highlight the potential of obtaining strong LLM embedding models through fine-tuning Value Aggregation.
翻译:句子表示是许多自然语言处理(NLP)应用的基础。尽管近期方法利用大语言模型(LLM)来获取句子表示,但多数方法依赖最终层的隐状态,而这些隐状态是针对下一词预测优化的,因此往往难以捕捉全局的句子级语义。本文提出了一个新颖的观点,证明注意力值向量比隐状态更能有效地捕捉句子语义。我们提出值聚合(VA)方法,这是一种简单的方法,通过对多个层和词元索引的词元值进行池化。在无需训练的条件下,VA方法优于其他基于LLM的嵌入方法,甚至可以匹敌或超越基于集成的方法MetaEOL。此外,我们证明当与合适的提示词搭配时,层注意力输出可以被解释为对齐的加权值向量。具体而言,最后一个词元的注意力得分起到权重的作用,而输出投影矩阵($W_O$)将这些加权值向量与LLM残差流的公共空间对齐。这种改进后的方法,称为对齐加权VA(AlignedWVA),在无需训练的基于LLM的嵌入方法中达到了最优性能,并以显著优势超越了高成本的MetaEOL。最后,我们强调了通过微调值聚合来获得强大LLM嵌入模型的潜力。