自回归排序：弥合双编码器与交叉编码器之间的差距 (Autoregressive Ranking: Bridging the Gap Between Dual and Cross Encoders)

Dual and cross encoders have long been mainstays of information retrieval (IR), but are being challenged by the emergent capabilities of LLMs. An LLM-based approach we term pointwise generative ranking - generating tokens the length of a single docID as opposed to a list in order to enable ranking via beam search - combines efficiency and expressivity benefits while leveraging the in-context capabilities of Causal Transformers. Although there is ample evidence to suggest that pretrained LLMs are well-suited for ranking, we find that the vast majority of LLM-based approaches rely on next-token prediction, a loss function which is fundamentally rank-agnostic (and especially so with pointwise supervision). In this paper, we first prove that the expressivity of pointwise generative ranking with multi-token docIDs is superior to that of dual encoders. We then propose SToICaL - a Simple Token-Item Calibrated Loss - which can incorporate rank-aware supervision at both the item and token levels within the pointwise setup. We run a suite of experiments on ranking tasks derived from WordNet (Fellbaum, 1998) and ESCI (Reddy et al., arXiv:2206.06588). Two variants of SToICaL successfully suppress the probability of invalid docID generations and improve on common ranking metrics beyond top-1 retrieval.

翻译：双编码器与交叉编码器长期以来一直是信息检索（IR）领域的核心方法，但正受到大语言模型（LLM）新兴能力的挑战。我们提出一种基于LLM的逐点生成排序方法——通过生成单个文档ID长度的标记（而非列表）以实现基于束搜索的排序——该方法在利用因果Transformer上下文能力的同时，兼顾了效率与表达优势。尽管大量证据表明预训练LLM适用于排序任务，但我们发现绝大多数基于LLM的方法依赖下一标记预测，这种损失函数本质上是与排序无关的（在逐点监督下尤为明显）。本文首先证明使用多标记文档ID的逐点生成排序在表达能力上优于双编码器。随后我们提出SToICaL——一种简单的标记-项目校准损失函数——可在逐点框架下同时融入项目级与标记级的排序感知监督。我们在基于WordNet（Fellbaum, 1998）和ESCI（Reddy et al., arXiv:2206.06588）构建的排序任务上进行了系列实验。两种SToICaL变体均成功抑制了无效文档ID生成的概率，并在多项常见排序指标上超越了仅优化Top-1检索的性能。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

【EMNLP2025】ReCode：基于细粒度检索增强生成的LLM代码修复方法

专知会员服务

10+阅读 · 2025年9月3日

人工智能驱动的自动程序修复与代码生成的技术与进展全面综述

专知会员服务

25+阅读 · 2024年11月15日

【NeurIPS 2024】分治与共识的结合：释放函数在代码生成中的强大力量

专知会员服务

16+阅读 · 2024年10月7日