Recently, open-domain question answering systems have begun to rely heavily on annotated datasets to train neural passage retrievers. However, manually annotating such datasets is both difficult and time-consuming, which limits their availability for less popular languages. In this work, we experiment with several methods for automatically collecting weakly labeled datasets and show how they affect the performance of the neural passage retrieval models. As a result of our work, we publish the MAUPQA dataset, consisting of nearly 400,000 question-passage pairs for Polish, as well as the HerBERT-QA neural retriever.
翻译:近期,开放域问答系统已开始严重依赖标注数据集来训练神经段落检索器。然而,人工标注此类数据集既困难又耗时,这限制了它们在低资源语言中的可用性。在本文中,我们实验了多种自动采集弱标注数据集的方法,并展示了它们如何影响神经段落检索模型的性能。作为研究成果,我们发布了MAUPQA数据集,包含近40万个波兰语问题-段落对,以及HerBERT-QA神经检索器。