The success of Large Language Models (LLMs) has motivated a shift toward generative approaches to retrieval and ranking, aiming to supersede classical Dual Encoders (DEs) and Cross Encoders (CEs). A prominent paradigm is pointwise Autoregressive Ranking (ARR), where an LLM generates document identifiers (docIDs) token-by-token to enable ranking via beam search. ARR offers the promise of superior expressivity compared to DEs while avoiding the prohibitive computational cost of CEs. However, a formal theoretical foundation for this expressive power has been missing. Moreover, the standard next-token prediction loss is rank-agnostic and inappropriate for finetuning an LLM for ranking tasks. In this paper, we first prove that the expressive capacity of ARR is strictly superior to DEs. While a DE requires an embedding dimension that grows linearly with corpus size to achieve arbitrary rankings, ARR can solve it with a constant hidden dimension. We then propose SToICaL (Simple Token-Item Calibrated Loss), a generalized rank-aware training loss for LLM finetuning. By using item-level reweighting and prefix-tree marginalization, we distribute probability mass over valid docID tokens based on their ground-truth relevance. Experiments on WordNet and ESCI datasets verify that our loss suppresses invalid docID generations and significantly improves ranking metrics beyond top-1 retrieval.
翻译:大型语言模型(LLM)的成功推动了检索与排序方法向生成式范式的转变,旨在取代经典的双编码器(DE)与交叉编码器(CE)。其中一种重要范式是逐点自回归排序(ARR),即LLM通过逐词生成文档标识符(docID),并借助束搜索实现排序。与DE相比,ARR有望提供更强的表达能力,同时避免CE难以承受的计算开销。然而,关于这种表达能力的形式化理论基础一直缺失。此外,标准的下一词预测损失与排序任务无关,不适合用于微调LLM进行排序。本文首先证明ARR的表达能力严格优于DE:DE需要嵌入维度随语料库规模线性增长才能实现任意排序,而ARR仅需恒定隐藏维度即可解决该问题。随后,我们提出SToICaL(简单词项校准损失),一种用于LLM微调的广义排序感知训练损失。通过使用项级重加权与前缀树边缘化,我们依据真实相关性在有效的docID词符上分配概率质量。在WordNet和ESCI数据集上的实验表明,该损失能有效抑制无效docID的生成,并在排序指标上显著超越仅优化Top-1检索的性能。