Reliable reward models (RMs) are critical for ensuring the safe alignment of large language models (LLMs). However, current RM evaluation methods focus solely on preference perception accuracies in given specific scenarios, obscuring the critical vulnerabilities of RMs in real-world scenarios. We identify the true challenge lies in assessing a novel dimension: Suitability, defined as conditional reliability under specific real-world perturbations. To this end, we introduce Reward Auditor, a hypothesis-testing framework specifically designed for RM suitability inference. Rather than answering "How accurate is the RM's preference perception for given samples?", it employs scientific auditing to answer: "Can we infer RMs exhibit systematic vulnerabilities in specific real-world scenarios?". Under real-world perturbed scenarios, Reward Auditor quantifies statistical significance and effect size by auditing distribution degradation of RM preference perception confidence. This enables inference of both the certainty and severity of RM vulnerabilities across diverse real-world scenarios. This lays a solid foundation for building next-generation LLM alignment systems that are verifiably safe, more robust, and trustworthy.
翻译:可靠的奖励模型(RMs)对于确保大型语言模型(LLMs)的安全对齐至关重要。然而,当前的RM评估方法仅关注给定特定场景中的偏好感知准确度,这掩盖了RM在现实场景中的关键脆弱性。我们发现真正的挑战在于评估一个新的维度:适用性,即特定现实扰动下的条件可靠性。为此,我们引入了奖励审计员,一个专门为RM适用性推断设计的假设检验框架。它不回答“RM对给定样本的偏好感知有多准确?”,而是通过科学审计来回答:“我们能否推断RM在特定现实场景中表现出系统性脆弱性?”。在现实扰动场景下,奖励审计员通过审计RM偏好感知置信度的分布退化,量化统计显著性和效应大小。这使得能够推断不同现实场景中RM脆弱性的确定性和严重程度。这为构建可验证安全、更鲁棒且可信赖的下一代LLM对齐系统奠定了坚实基础。