Reasoning-capable large language models (LLMs) have recently been adopted as automated judges, but their benefits and costs in LLM-as-a-Judge settings remain unclear. Through controlled comparisons between reasoning and non-reasoning judges, we show that explicit reasoning substantially improves judgment accuracy on tasks requiring structured verification (e.g., math and coding), while offering limited or even negative gains on simpler evaluations and incurring significantly higher computational cost. These findings motivate that reasoning should be used selectively rather than universally, with awareness of possible distribution shift. We propose a Robust Adaptive Cost-Efficient Routing (RACER), which dynamically selects between reasoning and non-reasoning judges under a fixed budget by formulating routing as a constrained distributionally robust optimization problem. RACER explicitly accounts for distribution shift via a KL-divergence uncertainty set, admits an efficient primal--dual algorithm, and enjoys theoretical guarantees including uniqueness of the optimal policy and linear convergence. Extensive experiments show that RACER achieves superior accuracy--cost trade-offs under distribution shift.
翻译:具备推理能力的大型语言模型(LLM)近期被用作自动评判者,但其在LLM评判场景中的优势与成本尚不明确。通过对比推理型与非推理型评判者,我们发现显式推理能显著提升需要结构化验证(如数学与编程)的任务的评判准确性,但在简单评估中改进有限甚至产生负面效果,同时大幅增加计算成本。这些发现表明应选择性而非普遍性地使用推理,并需注意潜在的数据分布偏移。我们提出鲁棒自适应成本高效路由(RACER),该框架通过将路由建模为约束分布鲁棒优化问题,在固定预算下动态选择推理型与非推理型评判者。RACER通过KL散度不确定性集显式建模分布偏移,支持高效原始-对偶算法,并具备理论保证,包括最优策略唯一性与线性收敛性。大量实验表明,RACER在分布偏移下能够实现优越的准确率-成本权衡。