Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

Scalar reward models compress multi-dimensional human preferences into a single opaque score, creating an information bottleneck that often leads to brittleness and reward hacking in open-ended alignment. We argue that robust alignment for non-verifiable tasks is fundamentally a principle generalization problem: reward should not be a learned function internalized into a judge, but an explicit reasoning process executed under inspectable principles. To operationalize this view, we present the Open Rubric System (OpenRS), a plug-and-play, rubrics-based LLM-as-a-Judge framework built around Pairwise Adaptive Meta-Rubrics (PAMR) and lightweight Pointwise Verifiable Rubrics (PVRs), which provide both hard-constraint guardrails and verifiable reward components when ground-truth or programmatic checks are available. OpenRS uses an explicit meta-rubric -- a constitution-like specification that governs how rubrics are instantiated, weighted, and enforced -- and instantiates adaptive rubrics on the fly by conditioning on the semantic differences between two candidate responses. It then performs criterion-wise pairwise comparisons and aggregates criterion-level preferences externally, avoiding pointwise weighted scalarization while improving discriminability in open-ended settings. To keep principles consistent yet editable across various domains, we introduce a two-level meta-rubric refinement pipeline (automated evolutionary refinement for general principles and a reproducible human-in-the-loop procedure for domain principles), complemented with pointwise verifiable rubrics that act as both guardrails against degenerate behaviors and a source of verifiable reward for objective sub-tasks. Finally, we instantiate OpenRS as reward supervision in pairwise RL training.

翻译：标量奖励模型将多维人类偏好压缩为单一不透明分数，这种信息瓶颈常导致开放式对齐任务中的脆弱性和奖励破解问题。我们认为，不可验证任务的稳健对齐本质上是原则泛化问题：奖励不应是内化于评判者的学习函数，而应是在可审查原则下执行的显式推理过程。为实现这一理念，本文提出开放式评分系统（OpenRS）——一个基于评分标准的即插即用式"LLM即评判者"框架，其核心由成对自适应元评分标准（PAMR）与轻量级点式可验证评分标准（PVR）构成，前者在存在真值或程序化检查时同时提供硬约束护栏与可验证奖励组件。OpenRS采用显式元评分标准（类似宪法的规范，用于管理评分标准的实例化、加权与执行机制），并通过条件化两个候选响应的语义差异实时实例化自适应评分标准。系统随后执行准则级成对比较，并在外部聚合准则级偏好，避免点式加权标量化，同时提升开放式场景下的判别能力。为保持原则跨领域的一致性与可编辑性，我们引入双层元评分标准优化流程（通用原则采用自动化进化优化，领域原则采用可复现的人机协同流程），并辅以点式可验证评分标准，既作为抵御退化行为的护栏，又为客观子任务提供可验证奖励源。最后，我们将OpenRS实例化为成对强化学习训练中的奖励监督机制。