Using a language model to score or rank candidate responses has become a scalable alternative to human evaluation in reinforcement learning from human feedback (RLHF) pipelines, benchmarking, and application layer evaluations. However, output reliability depends heavily on prompting and aggregation strategy. We present an empirical investigation of four drop-in techniques -- ensemble scoring, task-specific criteria injection, calibration context, and adaptive model escalation -- for improving LLM judge accuracy on RewardBench 2, with a unifying lens of noise control on the stochastic judge: ensembling as Monte Carlo averaging over per-call noise, criteria injection as between-response discrimination sharpening, and per-response score variance as an uncertainty signal. Ensemble scoring and task-specific criteria injection (the latter virtually cost free) together reach up to 85.8% accuracy, +13.5pp over baseline. Calibration context and adaptive model escalation also improve over baseline but are dominated by criteria + ensembling on the cost-accuracy Pareto frontier. Small models benefit disproportionately from ensembling, making high-accuracy LLM judges accessible at low cost. We show that these techniques generalise across model providers, evaluating on both OpenAI GPT and Anthropic Claude families.
翻译:使用语言模型对候选回答进行评分或排序,已成为强化学习人类反馈(RLHF)流程、基准测试及应用层评估中可规模化替代人工评价的方法。然而,输出的可靠性高度依赖于提示设计和聚合策略。本文对四种即插即用技术——集成评分、任务特定准则注入、校准上下文和自适应模型升级——进行了实证研究,旨在提升LLM评判者在RewardBench 2上的准确率,并以噪声控制为统一视角分析随机评判者:集成评分作为单次调用噪声的蒙特卡洛平均,准则注入作为区分响应的锐化机制,以及单次响应评分方差作为不确定性信号。集成评分和任务特定准则注入(后者几乎零成本)共同实现最高85.8%的准确率,较基线提升13.5个百分点。校准上下文和自适应模型升级虽较基线有所改进,但在成本-准确率的帕累托前沿上劣于准则+集成组合。小型模型从集成中获益更大,使得低成本实现高准确率LLM评判者成为可能。研究表明,这些技术可跨模型提供商泛化,并在OpenAI GPT和Anthropic Claude系列模型上进行了评估。