The recent advances in deep-learning have led to the development of highly sophisticated systems with an unquenchable appetite for data. On the other hand, building good deep-learning models for low-resource languages remains a challenging task. This paper focuses on developing a Question Answering dataset for two such languages- Hindi and Marathi. Despite Hindi being the 3rd most spoken language worldwide, with 345 million speakers, and Marathi being the 11th most spoken language globally, with 83.2 million speakers, both languages face limited resources for building efficient Question Answering systems. To tackle the challenge of data scarcity, we have developed a novel approach for translating the SQuAD 2.0 dataset into Hindi and Marathi. We release the largest Question-Answering dataset available for these languages, with each dataset containing 28,000 samples. We evaluate the dataset on various architectures and release the best-performing models for both Hindi and Marathi, which will facilitate further research in these languages. Leveraging similarity tools, our method holds the potential to create datasets in diverse languages, thereby enhancing the understanding of natural language across varied linguistic contexts. Our fine-tuned models, code, and dataset will be made publicly available.
翻译:深度学习的最新进展推动了高度复杂系统的发展,这些系统对数据有着难以满足的需求。然而,为低资源语言构建良好的深度学习模型仍然是一项具有挑战性的任务。本文专注于为两种低资源语言——印地语和马拉地语——开发问答数据集。尽管印地语是全球使用人数第三多的语言(拥有3.45亿使用者),马拉地语是全球使用人数第11多的语言(拥有8320万使用者),但这两种语言在构建高效问答系统方面仍面临资源有限的问题。为应对数据稀缺的挑战,我们提出了一种新颖的方法,将SQuAD 2.0数据集翻译成印地语和马拉地语。我们发布了目前这两种语言可用的最大问答数据集,每个数据集包含28,000个样本。我们在多种架构上评估了该数据集,并发布了针对印地语和马拉地语的最佳性能模型,这将促进这两种语言的进一步研究。利用相似性工具,我们的方法具有为多种语言创建数据集的潜力,从而增强对跨语言语境的自然语言理解。我们微调后的模型、代码及数据集将公开发布。