But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors

LLM-as-a-judge is widely used as a scalable substitute for human evaluation, yet current approaches rely on black-box access and struggle to detect subtle dishonesty, such as sycophancy and manipulation. We introduce Judge Using Safety-Steered Alternatives (JUSSA), a framework that leverages a model's internal representations to optimize an honesty-promoting steering vector from a single training example, generating contrastive alternatives that give judges a reference point for detecting dishonesty. We test JUSSA on a novel manipulation benchmark with human-validated response pairs at varying dishonesty levels, finding AUROC improvements across both GPT-4.1 (0.893 $\to$ 0.946) and Claude Haiku (0.859 $\to$ 0.929) judges, though performance degrades when task complexity is mismatched to judge capability, suggesting contrastive evaluation helps most when the task is challenging but within the judge's reach. Layer-wise analysis further shows that steering is most effective in middle layers, where model representations begin to diverge between honest and dishonest prompt processing. Our work demonstrates that steering vectors can serve as tools for evaluation rather than for improving model outputs at inference, opening a new direction for thorough white-box auditing.

翻译：LLM-as-a-judge 被广泛用作人类评估的可扩展替代方案，但当前方法依赖黑盒访问，难以检测细微的不诚实行为（如谄媚和操纵）。我们提出了JUSSA（Judge Using Safety-Steered Alternatives）框架，该框架利用模型内部表征，从单个训练样本优化一个促进诚实的引导向量，生成对比性替代方案，为法官提供检测不诚实行为的参考点。我们在一个包含经人类验证、具有不同不诚实程度响应对的新型操纵基准上测试了JUSSA，发现GPT-4.1（0.893→0.946）和Claude Haiku（0.859→0.929）法官的AUROC均有所提升，但当任务复杂度与法官能力不匹配时性能会下降，表明对比性评估在任务具有挑战性但仍在法官能力范围内时效果最佳。逐层分析进一步表明，引导效果在中层最为显著，这些层中模型在处理诚实与不诚实提示时的表征开始分化。我们的工作证明，引导向量可作为评估工具而非推理时改进模型输出，为彻底的白盒审计开辟新方向。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

智能体评判者（Agent-as-a-Judge）研究综述

专知会员服务

37+阅读 · 1月9日

【EMNLP2025】ReCode：基于细粒度检索增强生成的LLM代码修复方法

专知会员服务

10+阅读 · 2025年9月3日

142页DeepSeek-R1 思维链技术：让我们一起<思考>大语言模型（LLM）的推理能力

专知会员服务

48+阅读 · 2025年4月12日

【ICLR2025】LLMS能否识别您的偏好？评估LLMS中的个性化偏好遵循能力

专知会员服务

14+阅读 · 2025年2月14日