The BEIR dataset is a large, heterogeneous benchmark for Information Retrieval (IR) in zero-shot settings, garnering considerable attention within the research community. However, BEIR and analogous datasets are predominantly restricted to the English language. Our objective is to establish extensive large-scale resources for IR in the Polish language, thereby advancing the research in this NLP area. In this work, inspired by mMARCO and Mr.~TyDi datasets, we translated all accessible open IR datasets into Polish, and we introduced the BEIR-PL benchmark -- a new benchmark which comprises 13 datasets, facilitating further development, training and evaluation of modern Polish language models for IR tasks. We executed an evaluation and comparison of numerous IR models on the newly introduced BEIR-PL benchmark. Furthermore, we publish pre-trained open IR models for Polish language,d marking a pioneering development in this field. Additionally, the evaluation revealed that BM25 achieved significantly lower scores for Polish than for English, which can be attributed to high inflection and intricate morphological structure of the Polish language. Finally, we trained various re-ranking models to enhance the BM25 retrieval, and we compared their performance to identify their unique characteristic features. To ensure accurate model comparisons, it is necessary to scrutinise individual results rather than to average across the entire benchmark. Thus, we thoroughly analysed the outcomes of IR models in relation to each individual data subset encompassed by the BEIR benchmark. The benchmark data is available at URL {\bf https://huggingface.co/clarin-knext}.
翻译:BEIR数据集是一个大规模、异构的零样本信息检索基准,在学术界引起了广泛关注。然而,BEIR及类似数据集主要局限于英语。我们的目标是为波兰语的信息检索建立大规模资源,从而推动该自然语言处理领域的研究。受mMARCO和Mr. TyDi数据集的启发,本研究将所有可访问的开放信息检索数据集翻译成波兰语,并提出了BEIR-PL基准——一个包含13个数据集的新基准,以促进现代波兰语语言模型在信息检索任务中的进一步开发、训练和评估。我们在新提出的BEIR-PL基准上评估和比较了众多信息检索模型。此外,我们发布了波兰语预训练的开放信息检索模型,这标志着该领域的开创性发展。评估结果还显示,BM25在波兰语中的得分显著低于英语,这归因于波兰语高度的屈折变化和复杂的形态结构。最后,我们训练了多种重排序模型以增强BM25检索,并通过比较它们的性能来识别其独特特征。为确保模型比较的准确性,有必要单独审视每个结果而非对基准整体取平均。因此,我们详细分析了信息检索模型在BEIR基准所涵盖的每个独立数据子集上的表现。基准数据可在URL {\bf https://huggingface.co/clarin-knext} 获取。