The success of Large Language Models (LLMs) has motivated a shift toward generative approaches to retrieval and ranking, aiming to supersede classical Dual Encoders (DEs) and Cross Encoders (CEs). A prominent paradigm is pointwise Autoregressive Ranking (ARR), where an LLM generates document identifiers (docIDs) token-by-token to enable ranking via beam search. ARR offers the promise of superior expressivity compared to DEs while avoiding the prohibitive computational cost of CEs. However, a formal theoretical foundation for this expressive power has been missing. Moreover, the standard next-token prediction loss is rank-agnostic and inappropriate for finetuning an LLM for ranking tasks. In this paper, we first prove that the expressive capacity of ARR is strictly superior to DEs. While a DE requires an embedding dimension that grows linearly with corpus size to achieve arbitrary rankings, ARR can solve it with a constant hidden dimension. We then propose SToICaL (Simple Token-Item Calibrated Loss), a generalized rank-aware training loss for LLM finetuning. By using item-level reweighting and prefix-tree marginalization, we distribute probability mass over valid docID tokens based on their ground-truth relevance. Experiments on WordNet and ESCI datasets verify that our loss suppresses invalid docID generations and significantly improves ranking metrics beyond top-1 retrieval.
翻译:大型语言模型(LLM)的成功推动了检索与排序方法向生成式范式的转变,旨在取代经典的双编码器(DE)和交叉编码器(CE)。一个突出的范式是点式自回归排序(ARR),其中LLM通过逐词生成文档标识符(docID),并借助束搜索实现排序。与DE相比,ARR有望提供更强的表达能力,同时避免CE高昂的计算成本。然而,关于这种表达能力的严格理论基础一直缺失。此外,标准的下一个词预测损失与排序任务无关,不适合用于微调LLM进行排序。本文首先证明了ARR的表达能力严格优于DE:DE需要嵌入维度随语料库大小线性增长才能实现任意排序,而ARR仅需恒定隐藏维度即可解决该问题。随后,我们提出了SToICaL(简单词项校准损失),一种用于LLM微调的广义排序感知训练损失。通过使用项级重加权与前缀树边缘化,我们根据真实相关性在有效的docID词符上分配概率质量。在WordNet和ESCI数据集上的实验验证了我们的损失能有效抑制无效docID的生成,并显著提升了排序指标,超越了仅关注Top-1检索的性能。