We present Polish Information Retrieval Benchmark (PIRB), a comprehensive evaluation framework encompassing 41 text information retrieval tasks for Polish. The benchmark incorporates existing datasets as well as 10 new, previously unpublished datasets covering diverse topics such as medicine, law, business, physics, and linguistics. We conduct an extensive evaluation of over 20 dense and sparse retrieval models, including the baseline models trained by us as well as other available Polish and multilingual methods. Finally, we introduce a three-step process for training highly effective language-specific retrievers, consisting of knowledge distillation, supervised fine-tuning, and building sparse-dense hybrid retrievers using a lightweight rescoring model. In order to validate our approach, we train new text encoders for Polish and compare their results with previously evaluated methods. Our dense models outperform the best solutions available to date, and the use of hybrid methods further improves their performance.
翻译:我们提出波兰信息检索基准(PIRB),这是一个包含41项波兰语文本信息检索任务的综合评估框架。该基准不仅整合现有数据集,还引入10个全新未公开数据集,涵盖医学、法律、商业、物理及语言学等多个领域。我们对超过20种稠密与稀疏检索模型进行了广泛评估,包括我们训练的基线模型及其他现有波兰语和多语言方法。最后,我们提出一个三步训练法以构建高效语言特定检索器,包含知识蒸馏、监督微调,以及通过轻量级重排序模型构建稀疏-稠密混合检索器。为验证该方法的有效性,我们训练了新型波兰语文本编码器,并将其结果与先前评估的方法进行对比。实验表明,我们的稠密模型优于现有最佳方案,而混合方法的运用进一步提升了其性能。