Test-time scaling (TTS) techniques can improve the performance of large language models (LLMs) at the expense of additional computation and latency. While TTS has proven effective in formal domains such as mathematics and programming, its value in argumentative domains such as law remains underexplored. We present an empirical study of verifier-based TTS methods for legal multiple-choice QA (MCQA) across five benchmarks. Using a family of 7 reward models, we evaluate both outcome-level (Best-of-$N$) and process-level (tree search) verification under realistic low-$N$ budgets. Our analysis systematically investigates how verifier utility is affected by key properties such as domain specialization, model size, and supervision type (process-supervised PRMs vs. outcome-only ORMs), even when applied across different roles.
翻译:测试时扩展(TTS)技术能够提升大语言模型(LLMs)的性能,但需以额外的计算开销和延迟为代价。尽管TTS在数学和编程等规范领域已证明有效,其在法律等论证性领域中的价值仍待深入探索。本文针对法律多项选择问答(MCQA)任务,在五个基准测试上对基于验证器的TTS方法进行了实证研究。利用一组包含7个奖励模型的家族,我们在实际低N预算条件下,评估了结果层面(Best-of-N)和过程层面(树搜索)的验证效果。我们的分析系统性地探讨了验证器效用如何受关键特性影响,包括领域专精性、模型规模、监督类型(过程监督的PRM与仅结果的ORM),即使这些验证器应用于不同角色时亦然。