BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Existing neural information retrieval (IR) models have often been studied in homogeneous and narrow settings, which has considerably limited insights into their out-of-distribution (OOD) generalization capabilities. To address this, and to facilitate researchers to broadly evaluate the effectiveness of their models, we introduce Benchmarking-IR (BEIR), a robust and heterogeneous evaluation benchmark for information retrieval. We leverage a careful selection of 18 publicly available datasets from diverse text retrieval tasks and domains and evaluate 10 state-of-the-art retrieval systems including lexical, sparse, dense, late-interaction and re-ranking architectures on the BEIR benchmark. Our results show BM25 is a robust baseline and re-ranking and late-interaction-based models on average achieve the best zero-shot performances, however, at high computational costs. In contrast, dense and sparse-retrieval models are computationally more efficient but often underperform other approaches, highlighting the considerable room for improvement in their generalization capabilities. We hope this framework allows us to better evaluate and understand existing retrieval systems, and contributes to accelerating progress towards better robust and generalizable systems in the future. BEIR is publicly available at https://github.com/UKPLab/beir.

翻译：现有神经信息检索(IR)模型往往在单一和狭窄的环境中进行研究,这些模型对于其分配外(OOD)一般化能力的洞察力相当有限。为了解决这个问题,并且为了便利研究人员广泛评价其模型的有效性,我们采用基准-IR(BEIR),这是信息检索的强有力和多样化的评价基准。我们从不同的文本检索任务和领域仔细选择了18个公开可获取的数据集,并评价了10个最先进的检索系统,包括词汇、稀少、密集、晚间互动和BEIR基准的重新定位结构。我们的结果显示,BM25是一个强有力的基线,并且重新排序和晚间互动模型,平均以高计算成本实现最佳零点性能。相比之下,密集和稀少的检索模型在计算上效率更高,但往往低于其他方法,突出了改进一般化能力的巨大空间。我们希望这一框架使我们能够更好地评估和理解现有的检索系统,并有助于加快未来更稳健和普遍化系统的进展。BEBI/RBIR在 http://BAGIGR/COM上公开提供。