LLM-as-a-judge is widely used as a scalable substitute for human evaluation, yet current approaches rely on black-box access and struggle to detect subtle dishonesty, such as sycophancy and manipulation. We introduce Judge Using Safety-Steered Alternatives (JUSSA), a framework that leverages a model's internal representations to optimize an honesty-promoting steering vector from a single training example, generating contrastive alternatives that give judges a reference point for detecting dishonesty. We test JUSSA on a novel manipulation benchmark with human-validated response pairs at varying dishonesty levels, finding AUROC improvements across both GPT-4.1 (0.893 $\to$ 0.946) and Claude Haiku (0.859 $\to$ 0.929) judges, though performance degrades when task complexity is mismatched to judge capability, suggesting contrastive evaluation helps most when the task is challenging but within the judge's reach. Layer-wise analysis further shows that steering is most effective in middle layers, where model representations begin to diverge between honest and dishonest prompt processing. Our work demonstrates that steering vectors can serve as tools for evaluation rather than for improving model outputs at inference, opening a new direction for thorough white-box auditing.
翻译:LLM-as-a-judge 被广泛用作人类评估的可扩展替代方案,但当前方法依赖黑盒访问,难以检测细微的不诚实行为(如谄媚和操纵)。我们提出了JUSSA(Judge Using Safety-Steered Alternatives)框架,该框架利用模型内部表征,从单个训练样本优化一个促进诚实的引导向量,生成对比性替代方案,为法官提供检测不诚实行为的参考点。我们在一个包含经人类验证、具有不同不诚实程度响应对的新型操纵基准上测试了JUSSA,发现GPT-4.1(0.893→0.946)和Claude Haiku(0.859→0.929)法官的AUROC均有所提升,但当任务复杂度与法官能力不匹配时性能会下降,表明对比性评估在任务具有挑战性但仍在法官能力范围内时效果最佳。逐层分析进一步表明,引导效果在中层最为显著,这些层中模型在处理诚实与不诚实提示时的表征开始分化。我们的工作证明,引导向量可作为评估工具而非推理时改进模型输出,为彻底的白盒审计开辟新方向。