In this paper, we analyze the capabilities of the multi-lingual Dense Passage Retriever (mDPR) for extremely low-resource languages. In the Cross-lingual Open-Retrieval Answer Generation (CORA) pipeline, mDPR achieves success on multilingual open QA benchmarks across 26 languages, of which 9 were unseen during training. These results are promising for Question Answering (QA) for low-resource languages. We focus on two extremely low-resource languages for which mDPR performs poorly: Amharic and Khmer. We collect and curate datasets to train mDPR models using Translation Language Modeling (TLM) and question--passage alignment. We also investigate the effect of our extension on the language distribution in the retrieval results. Our results on the MKQA and AmQA datasets show that language alignment brings improvements to mDPR for the low-resource languages, but the improvements are modest and the results remain low. We conclude that fulfilling CORA's promise to enable multilingual open QA in extremely low-resource settings is challenging because the model, the data, and the evaluation approach are intertwined. Hence, all three need attention in follow-up work. We release our code for reproducibility and future work: https://anonymous.4open.science/r/Question-Answering-for-Low-Resource-Languages-B13C/
翻译:本文分析了多语言稠密段落检索器(mDPR)在极低资源语言中的性能。在跨语言开放检索答案生成(CORA)流程中,mDPR在涵盖26种语言的多语言开放问答基准测试中取得了成功,其中9种语言在训练期间未见。这些结果为低资源语言的问答任务带来了希望。我们聚焦于mDPR表现不佳的两种极低资源语言:阿姆哈拉语和高棉语。我们收集并整理数据集,通过翻译语言建模和问题-段落对齐技术训练mDPR模型。同时研究了扩展方法对检索结果中语言分布的影响。在MKQA和AmQA数据集上的实验表明,语言对齐技术为低资源语言的mDPR带来了性能提升,但改进幅度有限且整体结果仍不理想。我们得出结论:要实现CORA在极低资源环境下支持多语言开放问答的愿景具有挑战性,因为模型、数据与评估方法相互交织。因此,后续工作需要同时关注这三个方面。我们公开代码以促进可复现性与未来研究:https://anonymous.4open.science/r/Question-Answering-for-Low-Resource-Languages-B13C/