Sentence representations are foundational to many Natural Language Processing (NLP) applications. While recent methods leverage Large Language Models (LLMs) to derive sentence representations, most rely on final-layer hidden states, which are optimized for next-token prediction and thus often fail to capture global, sentence-level semantics. This paper introduces a novel perspective, demonstrating that attention value vectors capture sentence semantics more effectively than hidden states. We propose Value Aggregation (VA), a simple method that pools token values across multiple layers and token indices. In a training-free setting, VA outperforms other LLM-based embeddings, even matches or surpasses the ensemble-based MetaEOL. Furthermore, we demonstrate that when paired with suitable prompts, the layer attention outputs can be interpreted as aligned weighted value vectors. Specifically, the attention scores of the last token function as the weights, while the output projection matrix ($W_O$) aligns these weighted value vectors with the common space of the LLM residual stream. This refined method, termed Aligned Weighted VA (AlignedWVA), achieves state-of-the-art performance among training-free LLM-based embeddings, outperforming the high-cost MetaEOL by a substantial margin. Finally, we highlight the potential of obtaining strong LLM embedding models through fine-tuning Value Aggregation.
翻译:句子表示是许多自然语言处理应用的基础。虽然近期方法利用大语言模型来推导句子表示,但大多依赖最终层隐藏状态,这些状态针对下一词元预测进行优化,因而往往无法捕捉全局的句子级语义。本文提出一种新颖视角,证明注意力值向量比隐藏状态能更有效地捕获句子语义。我们提出值聚合方法,这是一种跨多层和词元索引池化词元值的简单方法。在无训练设置下,VA 优于其他基于大语言模型的嵌入方法,甚至达到或超越了基于集成学习的 MetaEOL。进一步地,我们证明当配合合适的提示时,层注意力输出可解释为对齐的加权值向量。具体而言,最后词元的注意力分数充当权重,而输出投影矩阵将这些加权值向量与大语言模型残差流的公共空间对齐。这种改进方法称为对齐加权值聚合,在无训练的基于大语言模型的嵌入方法中实现了最先进的性能,以显著优势超越了高成本的 MetaEOL。最后,我们强调通过微调值聚合获得强大大语言模型嵌入模型的潜力。