We present a comparative study between cross-encoder and LLMs rerankers in the context of re-ranking effective SPLADE retrievers. We conduct a large evaluation on TREC Deep Learning datasets and out-of-domain datasets such as BEIR and LoTTE. In the first set of experiments, we show how cross-encoder rerankers are hard to distinguish when it comes to re-rerank SPLADE on MS MARCO. Observations shift in the out-of-domain scenario, where both the type of model and the number of documents to re-rank have an impact on effectiveness. Then, we focus on listwise rerankers based on Large Language Models -- especially GPT-4. While GPT-4 demonstrates impressive (zero-shot) performance, we show that traditional cross-encoders remain very competitive. Overall, our findings aim to to provide a more nuanced perspective on the recent excitement surrounding LLM-based re-rankers -- by positioning them as another factor to consider in balancing effectiveness and efficiency in search systems.
翻译:我们针对跨编码器与大语言模型(LLM)在高效SPLADE检索器重排序场景中的性能进行了比较研究。基于TREC深度学习数据集以及BEIR、LoTTE等跨领域数据集开展大规模评估。首组实验表明,在MS MARCO数据集上对SPLADE进行重排序时,不同跨编码器重排序器之间难以区分性能差异。而在跨领域场景中,模型类型与待重排序文档数量均对效果产生显著影响。随后,我们聚焦于基于大语言模型(尤其是GPT-4)的列表式重排序器。尽管GPT-4展现出令人印象深刻的零样本性能,但传统跨编码器仍保持强劲竞争力。总体而言,本研究旨在通过对搜索系统中效果与效率的平衡考量,为近期围绕基于LLM重排序器的研究热潮提供更细致的认知视角。