Modern open-domain question answering systems often rely on accurate and efficient retrieval components to find passages containing the facts necessary to answer the question. Recently, neural retrievers have gained popularity over lexical alternatives due to their superior performance. However, most of the work concerns popular languages such as English or Chinese. For others, such as Polish, few models are available. In this work, we present Silver Retriever, a neural retriever for Polish trained on a diverse collection of manually or weakly labeled datasets. Silver Retriever achieves much better results than other Polish models and is competitive with larger multilingual models. Together with the model, we open-source five new passage retrieval datasets.
翻译:现代开放域问答系统通常依赖准确且高效的检索组件来定位包含答案所需事实的段落。近年来,神经检索器因其卓越性能逐渐取代词法检索方法而受到青睐。然而,现有研究多集中于英语或中文等主流语言,针对波兰语等语言的可用模型寥寥无几。本文提出Silver Retriever——一种基于多样化人工标注或弱标注数据集训练的波兰语神经检索器。该模型不仅显著优于其他波兰语模型,更可与更大规模的多语言模型相媲美。除模型本身外,我们还开源了五个全新的段落检索数据集。