Reward models (RMs) play a central role throughout the language model (LM) pipeline, particularly in non-verifiable domains. However, the dominant LLM-as-a-Judge paradigm relies on the strong reasoning capabilities of large models, while alternative approaches require reference responses or explicit rubrics, limiting flexibility and broader accessibility. In this work, we propose FLIP (FLipped Inference for Prompt reconstruction), a reference-free and rubric-free reward modeling approach that reformulates reward modeling through backward inference: inferring the instruction that would most plausibly produce a given response. The similarity between the inferred and the original instructions is then used as the reward signal. Evaluations across four domains using 13 small language models show that FLIP outperforms LLM-as-a-Judge baselines by an average of 79.6%. Moreover, FLIP substantially improves downstream performance in extrinsic evaluations under test-time scaling via parallel sampling and GRPO training. We further find that FLIP is particularly effective for longer outputs and robust to common forms of reward hacking. By explicitly exploiting the validation-generation gap, FLIP enables reliable reward modeling in downscaled regimes where judgment methods fail. Code available at https://github.com/yikee/FLIP.
翻译:奖励模型(RMs)在整个语言模型(LM)流程中,尤其是在不可验证的领域中,发挥着核心作用。然而,当前主流的“LLM-as-a-Judge”范式依赖于大模型的强大推理能力,而其他方法则需要参考回答或明确的评分标准,这限制了其灵活性和更广泛的适用性。本文提出FLIP(基于逆向推理的提示重构),一种无需参考回答和评分标准的奖励建模方法。该方法通过逆向推理重新构建奖励建模任务:推断最可能生成给定回答的指令。随后,将推断出的指令与原始指令之间的相似度作为奖励信号。在四个领域中使用13个小型语言模型进行的评估表明,FLIP平均优于“LLM-as-a-Judge”基线方法79.6%。此外,通过并行采样和GRPO训练,FLIP在测试时扩展下的外部评估中显著提升了下游任务性能。我们进一步发现,FLIP对于较长的输出特别有效,并且对常见的奖励破解形式具有鲁棒性。通过显式利用验证与生成之间的差距,FLIP能够在判断方法失效的降尺度场景中实现可靠的奖励建模。代码发布于 https://github.com/yikee/FLIP。