Test-time scaling (TTS) techniques can improve the performance of large language models (LLMs) at the expense of additional computation and latency. While TTS has proven effective in formal domains such as mathematics and programming \citep{snell2024scaling, chen2024more}, its value in argumentative domains such as law remains underexplored. We present an empirical study of verifier-based TTS methods for legal multiple-choice QA (MCQA) across five benchmarks. Using a family of 7 reward models, we evaluate both outcome-level (Best-of-$N$) and process-level (tree search) verification under realistic low-$N$ budgets. Our analysis systematically investigates how verifier utility is affected by key properties such as domain specialization, model size, and supervision type (process-supervised PRMs vs. outcome-only ORMs), even when applied across different roles.
翻译:测试时扩展(TTS)技术能够提升大型语言模型(LLMs)的性能,但需以额外的计算和延迟为代价。尽管TTS在数学和编程等规范领域已被证明有效 \\citep{snell2024scaling, chen2024more},但其在法律等论证性领域的价值仍待深入探索。我们针对法律多项选择问答(MCQA)任务,在五个基准测试上对基于验证器的TTS方法进行了实证研究。利用一个包含7个奖励模型的模型族,我们在现实的低-$N$预算下评估了结果层面(Best-of-$N$)和过程层面(树搜索)的验证。我们的分析系统性地研究了验证器效用如何受关键特性影响,包括领域专精性、模型规模、监督类型(过程监督的PRMs与仅结果的ORMs),即使这些验证器应用于不同角色时亦然。