Open-ended reward modeling requires judges that can follow subtle, domain-specific preferences when verifiable answers are unavailable. Existing rubric-based methods often address this by generating criteria online for each query, but the extra generation step can add inference overhead and produce rigid or misaligned guidance. We introduce Eval-Skill, an exploration-guided method that synthesizes reusable evaluation skills for reward modeling and reframes reward guidance as context evolution rather than parameter training or per-query rubric generation. Using only 100 cases per domain for skill evolution, Eval-Skill synthesizes reusable domain-level evaluation skills through two progressive stages, workflow generation followed by principle generation, with exploration and selection interleaved across both stages. Once generated, a skill is directly injected into the judge context. Across multiple RM benchmarks, Eval-Skill consistently improves diverse judge backbones; on RewardBench 2, it yields significant gains over vanilla judging for each main backbone (+13.44% for Qwen3-8B, and 18.51% for DeepSeek-V4-Flash). Further analyses of evolution-time scaling, generalizability, and transferability show that compact evaluation skills offer an efficient new paradigm for LLM-based evaluation. Code is available at https://github.com/xing-stellus-yue/Eval-Skill.
翻译:开放式奖励建模需要裁判能遵循细微的、领域特定的偏好,尤其在无法获得可验证答案时。现有的基于评分准则的方法通常通过在线为每个查询生成标准来解决这一问题,但额外的生成步骤会增加推理开销,并产生僵化或失配的指导。我们提出Eval-Skill,一种探索引导的方法,用于合成可复用的评估技能进行奖励建模,并将奖励指导重新定义为上下文演化,而非参数训练或每次查询的评分准则生成。每个领域仅需100个案例即可进行技能演化,Eval-Skill通过两个渐进阶段(工作流生成与原则生成)合成可复用的领域级评估技能,并在两阶段间交替进行探索与选择。技能生成后,直接注入裁判上下文。在多个奖励建模基准测试中,Eval-Skill持续提升不同裁判骨干的性能;在RewardBench 2上,相对于朴素评估,各主要骨干均获得显著提升(Qwen3-8B提升13.44%,DeepSeek-V4-Flash提升18.51%)。对演化时间缩放性、泛化能力与可迁移性的进一步分析表明,紧凑的评估技能为基于大语言模型的评估提供了一种高效的新范式。代码已开源:https://github.com/xing-stellus-yue/Eval-Skill