BEIR-PL: Zero Shot Information Retrieval Benchmark for the Polish Language

The BEIR dataset is a large, heterogeneous benchmark for Information Retrieval (IR) in zero-shot settings, garnering considerable attention within the research community. However, BEIR and analogous datasets are predominantly restricted to the English language. Our objective is to establish extensive large-scale resources for IR in the Polish language, thereby advancing the research in this NLP area. In this work, inspired by mMARCO and Mr.~TyDi datasets, we translated all accessible open IR datasets into Polish, and we introduced the BEIR-PL benchmark -- a new benchmark which comprises 13 datasets, facilitating further development, training and evaluation of modern Polish language models for IR tasks. We executed an evaluation and comparison of numerous IR models on the newly introduced BEIR-PL benchmark. Furthermore, we publish pre-trained open IR models for Polish language,d marking a pioneering development in this field. Additionally, the evaluation revealed that BM25 achieved significantly lower scores for Polish than for English, which can be attributed to high inflection and intricate morphological structure of the Polish language. Finally, we trained various re-ranking models to enhance the BM25 retrieval, and we compared their performance to identify their unique characteristic features. To ensure accurate model comparisons, it is necessary to scrutinise individual results rather than to average across the entire benchmark. Thus, we thoroughly analysed the outcomes of IR models in relation to each individual data subset encompassed by the BEIR benchmark. The benchmark data is available at URL {\bf https://huggingface.co/clarin-knext}.

翻译：BEIR数据集是一个大规模、异构的零样本信息检索基准，在学术界引起了广泛关注。然而，BEIR及类似数据集主要局限于英语。我们的目标是为波兰语的信息检索建立大规模资源，从而推动该自然语言处理领域的研究。受mMARCO和Mr. TyDi数据集的启发，本研究将所有可访问的开放信息检索数据集翻译成波兰语，并提出了BEIR-PL基准——一个包含13个数据集的新基准，以促进现代波兰语语言模型在信息检索任务中的进一步开发、训练和评估。我们在新提出的BEIR-PL基准上评估和比较了众多信息检索模型。此外，我们发布了波兰语预训练的开放信息检索模型，这标志着该领域的开创性发展。评估结果还显示，BM25在波兰语中的得分显著低于英语，这归因于波兰语高度的屈折变化和复杂的形态结构。最后，我们训练了多种重排序模型以增强BM25检索，并通过比较它们的性能来识别其独特特征。为确保模型比较的准确性，有必要单独审视每个结果而非对基准整体取平均。因此，我们详细分析了信息检索模型在BEIR基准所涵盖的每个独立数据子集上的表现。基准数据可在URL {\bf https://huggingface.co/clarin-knext} 获取。

相关内容

关注 14

信息检索杂志（IR）为信息检索的广泛领域中的理论、算法分析和实验的发布提供了一个国际论坛。感兴趣的主题包括对应用程序（例如Web，社交和流媒体，推荐系统和文本档案）的搜索、索引、分析和评估。这包括对搜索中人为因素的研究、桥接人工智能和信息检索以及特定领域的搜索应用程序。官网地址：https://dblp.uni-trier.de/db/journals/ir/

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日