Inference-time scaling techniques have shown promise in enhancing the reasoning capabilities of large language models (LLMs). While recent research has primarily focused on training-time optimization, our work highlights inference-time reward model (RM)-based reasoning as a critical yet overlooked avenue. In this paper, we conduct a systematic analysis of RM behavior across downstream reasoning tasks, revealing three key limitations: (1) RM can impair performance on simple questions, (2) its discriminative ability declines with increased sampling, and (3) high search diversity undermines RM performance. To address these issues, we propose CRISP (Clustered Reward Integration with Stepwise Prefixing), a novel inference-time algorithm that clusters generated reasoning paths by final answers, aggregates reward signals at the cluster level, and adaptively updates prefix prompts to guide generation. Experimental results demonstrate that CRISP significantly enhances LLM reasoning performance, achieving up to 5% accuracy improvement over other RM-based inference methods and an average of 10% gain over advanced reasoning models.
翻译:推理时缩放技术已展现出增强大型语言模型(LLM)推理能力的潜力。尽管近期研究主要集中于训练时优化,我们的工作强调基于推理时奖励模型(RM)的推理是一个关键但被忽视的途径。本文对RM在下游推理任务中的行为进行了系统性分析,揭示了三个关键局限:(1)RM可能损害简单问题的性能,(2)其判别能力随采样增加而下降,(3)高搜索多样性会削弱RM表现。为解决这些问题,我们提出了CRISP(基于聚类奖励集成与逐步前缀引导),一种新颖的推理时算法。该算法通过最终答案对生成的推理路径进行聚类,在聚类层面聚合奖励信号,并自适应更新前缀提示以引导生成。实验结果表明,CRISP显著提升了LLM的推理性能,相比其他基于RM的推理方法实现了最高5%的准确率提升,相较于先进推理模型平均获得10%的性能增益。