TRIM: Hybrid Inference via Targeted Stepwise Routing in Multi-Step Reasoning Tasks

Multi-step reasoning tasks like mathematical problem solving are vulnerable to cascading failures, where a single incorrect step leads to complete solution breakdown. Current LLM routing methods assign entire queries to one model, treating all reasoning steps as equal. We propose TRIM (Targeted routing in multi-step reasoning tasks), which routes only critical steps$\unicode{x2013}$those likely to derail the solution$\unicode{x2013}$to larger models while letting smaller models handle routine continuations. Our key insight is that targeted step-level interventions can fundamentally transform inference efficiency by confining expensive calls to precisely those steps where stronger models prevent cascading errors. TRIM operates at the step-level: it uses process reward models to identify erroneous steps and makes routing decisions based on step-level uncertainty and budget constraints. We develop several routing strategies within TRIM, ranging from a simple threshold-based policy to more expressive policies that reason about long-horizon accuracy-cost trade-offs and uncertainty in step-level correctness estimates. On MATH-500, even the simplest thresholding strategy surpasses prior routing methods with 5x higher cost efficiency, while more advanced policies match the strong, expensive model's performance using 80% fewer expensive model tokens. On harder benchmarks such as AIME, TRIM achieves up to 6x higher cost efficiency. All methods generalize effectively across math reasoning tasks, demonstrating that step-level difficulty represents fundamental characteristics of reasoning.

翻译：在数学问题求解等多步推理任务中，单个错误步骤可能导致整个解决方案崩溃，即级联失效问题。现有的大型语言模型路由方法将整个查询分配给单一模型，将所有推理步骤视为同等重要。本文提出TRIM（多步推理任务中的目标路由），该方法仅将关键步骤——即可能导致解决方案偏离的步骤——路由至大型模型，而让较小模型处理常规延续步骤。我们的核心洞见在于：通过将高成本调用严格限制在那些需要更强模型以防止级联错误的关键步骤上，目标化的步骤级干预能够从根本上改变推理效率。TRIM在步骤级别运行：它使用过程奖励模型识别错误步骤，并基于步骤级不确定性及预算约束做出路由决策。我们在TRIM框架内开发了多种路由策略，从简单的基于阈值的策略，到能够权衡长时域精度-成本关系并考虑步骤级正确性估计不确定性的更具表达力的策略。在MATH-500数据集上，即使最简单的阈值策略也以5倍更高的成本效率超越了现有路由方法；而更先进的策略仅需使用20%的高成本模型令牌即可达到强大但昂贵模型的性能水平。在AIME等更难的数据集上，TRIM实现了高达6倍的成本效率提升。所有方法在数学推理任务中均展现出良好的泛化能力，证明步骤级难度反映了推理任务的根本特性。