Dense retrievers excel at first-stage candidate generation but lack effective reranking in zero-resource settings. Existing approaches face a fundamental dilemma: cross-encoders deliver strong reranking quality but require costly supervised training and incur high latency, while unsupervised BM25 reranking consistently degrades dense retrieval performance on most of BEIR benchmarks. We propose DART (Dense Adaptive Reranking at Test-time), which resolves this dilemma by adapting the scoring function at inference time. For each query, the top-ranked documents serve as pseudo-positive examples and the bottom-ranked as pseudo-negative examples, providing noisy but readily available supervision to adapt a bilinear scoring matrix $W$ via a small number of gradient updates. We further introduce a confidence-weighted margin loss and a cross-query momentum buffer that warm-starts adaptation across queries. On six BEIR benchmarks, DART achieves a mean per-dataset relative NDCG@10 gain of +2.1% over the dense retrieval baseline with under 10ms additional latency per query, demonstrating a powerful capability for zero-shot performance enhancement and cross-domain generalization.
翻译:稠密检索器在初筛候选生成中表现优异,但在零资源场景下缺乏有效的重排序能力。现有方法面临根本性困境:交叉编码器虽具有强大的重排序质量,但需要昂贵的监督训练且延迟较高,而无监督的BM25重排序反而在大多数BEIR基准上持续降低稠密检索性能。我们提出DART(测试时稠密自适应重排序),该方法通过在推理阶段调整评分函数解决这一困境。针对每个查询,将排名最高的文档作为伪正例、排名最低的文档作为伪负例,通过少量梯度更新调整双线性评分矩阵$W$,提供带噪但易于获取的监督信号。我们进一步引入置信加权边界损失和跨查询动量缓冲机制,实现跨查询的自适应预热启动。在六个BEIR基准上,DART相较于稠密检索基线实现了每数据集相对NDCG@10平均提升+2.1%,且每个查询的额外延迟低于10毫秒,展现了强大的零样本性能增强和跨域泛化能力。