Modern open-domain question answering systems often rely on accurate and efficient retrieval components to find passages containing the facts necessary to answer the question. Recently, neural retrievers have gained popularity over lexical alternatives due to their superior performance. However, most of the work concerns popular languages such as English or Chinese. For others, such as Polish, few models are available. In this work, we present SilverRetriever, a neural retriever for Polish trained on a diverse collection of manually or weakly labeled datasets. SilverRetriever achieves much better results than other Polish models and is competitive with larger multilingual models. Together with the model, we open-source five new passage retrieval datasets.
翻译:现代开放域问答系统通常依赖准确高效的检索组件,以定位包含回答问题所需事实的段落。近年来,神经检索器因其卓越性能逐渐超越词汇检索方法。然而,现有研究多集中于英语或汉语等主流语言,针对波兰语等语言的模型屈指可数。本文提出SilverRetriever——基于多样化人工标注与弱监督数据集训练的波兰语神经检索器。该模型不仅显著优于其他波兰语检索模型,且能与更大型的多语言模型相媲美。我们同时开源了五个全新的段落检索数据集。