The growing scale of evaluation tasks has led to the widespread adoption of automated evaluation using LLMs, a paradigm known as "LLM-as-a-judge". However, improving its alignment with human preferences without complex prompts or fine-tuning remains challenging. Previous studies mainly optimize based on shallow outputs, overlooking rich cross-layer representations. In this work, motivated by preliminary findings that middle-to-upper layers encode semantically and task-relevant representations that are often more aligned with human judgments than the final layer, we propose LAGER, a post-hoc, plug-and-play framework for improving the alignment of LLM-as-a-Judge point-wise evaluations with human scores by leveraging internal representations. LAGER produces fine-grained judgment scores by aggregating cross-layer score-token logits and computing the expected score from a softmax-based distribution, while keeping the LLM backbone frozen and ensuring no impact on the inference process. LAGER fully leverages the complementary information across different layers, overcoming the limitations of relying solely on the final layer. We evaluate our method on the standard alignment benchmarks Flask, HelpSteer, and BIGGen using Spearman correlation, and find that LAGER achieves improvements of up to 7.5% over the best baseline across these benchmarks. Without reasoning steps, LAGER matches or outperforms reasoning-based methods. Experiments on downstream applications, such as data selection and emotional understanding, further show the generalization of LAGER.
翻译:随着评估任务规模的日益增长,使用大语言模型进行自动化评估(即“LLM-as-a-Judge”范式)已被广泛采用。然而,在不依赖复杂提示或微调的情况下,提升其与人类偏好的对齐度仍具挑战。先前研究主要基于浅层输出进行优化,忽略了丰富的跨层表征。本工作中,基于初步发现——中上层网络层编码的语义及任务相关表征通常比最终层更符合人类判断,我们提出LAGER,一种后处理、即插即用框架,通过利用内部表征来提升LLM-as-a-Judge逐点评分与人类评分之间的对齐度。LAGER通过聚合跨层的得分-词元对数概率,并基于softmax分布计算期望得分,从而生成细粒度评判分数,同时保持大语言模型主干网络冻结且不影响推理过程。LAGER充分利用了不同网络层间的互补信息,克服了仅依赖最终层的局限性。我们在标准对齐基准Flask、HelpSteer和BIGGen上使用斯皮尔曼相关系数评估了我们的方法,发现LAGER在这些基准上相比最佳基线模型实现了最高7.5%的性能提升。在不使用推理步骤的情况下,LAGER达到或超越了基于推理的方法。在数据选择和情感理解等下游应用上的实验进一步验证了LAGER的泛化能力。