The BEIR dataset is a large, heterogeneous benchmark for Information Retrieval (IR) in zero-shot settings, garnering considerable attention within the research community. However, BEIR and analogous datasets are predominantly restricted to the English language. Our objective is to establish extensive large-scale resources for IR in the Polish language, thereby advancing the research in this NLP area. In this work, inspired by mMARCO and Mr.~TyDi datasets, we translated all accessible open IR datasets into Polish, and we introduced the BEIR-PL benchmark -- a new benchmark which comprises 13 datasets, facilitating further development, training and evaluation of modern Polish language models for IR tasks. We executed an evaluation and comparison of numerous IR models on the newly introduced BEIR-PL benchmark. Furthermore, we publish pre-trained open IR models for Polish language,d marking a pioneering development in this field. Additionally, the evaluation revealed that BM25 achieved significantly lower scores for Polish than for English, which can be attributed to high inflection and intricate morphological structure of the Polish language. Finally, we trained various re-ranking models to enhance the BM25 retrieval, and we compared their performance to identify their unique characteristic features. To ensure accurate model comparisons, it is necessary to scrutinise individual results rather than to average across the entire benchmark. Thus, we thoroughly analysed the outcomes of IR models in relation to each individual data subset encompassed by the BEIR benchmark. The benchmark data is available at URL {\bf https://huggingface.co/clarin-knext}.
翻译:BEIR数据集是一个用于信息检索(IR)零样本设置的大规模异构基准,在研究界引起了广泛关注。然而,BEIR及类似数据集主要局限于英语。我们的目标是为波兰语的信息检索建立大规模资源,从而推动这一自然语言处理领域的研究。在这项工作中,受mMARCO和Mr.~TyDi数据集的启发,我们将所有可获取的开放IR数据集翻译成波兰语,并推出了BEIR-PL基准——一个包含13个数据集的新基准,旨在促进面向IR任务的现代波兰语语言模型的进一步开发、训练与评估。我们在新推出的BEIR-PL基准上对多种IR模型进行了评估与比较。此外,我们还发布了针对波兰语预训练的开放IR模型,这标志着该领域的开创性进展。评估结果还显示,BM25在波兰语上的得分显著低于英语,这归因于波兰语的高度屈折变化及其复杂的形态结构。最后,我们训练了多种重排序模型以增强BM25的检索效果,并比较了它们的性能以识别其独特特征。为确保模型比较的准确性,需逐一审视个体结果而非对整个基准进行平均。因此,我们针对BEIR基准所涵盖的每个单独数据子集,深入分析了IR模型的结果。基准数据可在URL {\bf https://huggingface.co/clarin-knext} 获取。