Large language models (LLMs) are increasingly applied to ranking tasks in retrieval and recommendation. Although reasoning prompting can enhance ranking utility, our preliminary exploration reveals that its benefits are inconsistent and come at a substantial computational cost, suggesting that when to reason is as crucial as how to reason. To address this issue, we propose a reasoning routing framework that employs a lightweight, plug-and-play router head to decide whether to use direct inference (Non-Think) or reasoning (Think) for each instance before generation. The router head relies solely on pre-generation signals: i) compact ranking-aware features (e.g., candidate dispersion) and ii) model-aware difficulty signals derived from a diagnostic checklist reflecting the model's estimated need for reasoning. By leveraging these features before generation, the router outputs a controllable token that determines whether to apply the Think mode. Furthermore, the router can adaptively select its operating policy along the validation Pareto frontier during deployment, enabling dynamic allocation of computational resources toward instances most likely to benefit from Think under varying system constraints. Experiments on three public ranking datasets with different scales of open-source LLMs show consistent improvements in ranking utility with reduced token consumption (e.g., +6.3\% NDCG@10 with -49.5\% tokens on MovieLens with Qwen3-4B), demonstrating reasoning routing as a practical solution to the accuracy-efficiency trade-off.
翻译:大语言模型在检索与推荐领域的排序任务中应用日益广泛。尽管推理提示能够提升排序效用,但我们的初步探索发现其收益并不稳定,且伴随显著的计算开销,这表明“何时推理”与“如何推理”同等重要。为解决此问题,我们提出一种推理路由框架,该框架采用轻量级、即插即用的路由头,在生成前为每个实例决策是采用直接推理模式还是推理模式。路由头仅依赖于生成前的信号:i) 紧凑的排序感知特征,以及 ii) 源自诊断清单的模型感知难度信号,该清单反映了模型对推理的预估需求。通过利用这些生成前特征,路由头输出一个可控标记,以决定是否启用推理模式。此外,在部署过程中,路由头可沿验证帕累托前沿自适应选择其运行策略,从而在不同系统约束下,将计算资源动态分配给最可能从推理模式中受益的实例。在三个公开排序数据集上使用不同规模的开源大语言模型进行的实验表明,该方法在降低标记消耗的同时持续提升了排序效用,证明了推理路由是解决精度-效率权衡的一种实用方案。