Breaking Language Barriers: A Question Answering Dataset for Hindi and Marathi

The recent advances in deep-learning have led to the development of highly sophisticated systems with an unquenchable appetite for data. On the other hand, building good deep-learning models for low-resource languages remains a challenging task. This paper focuses on developing a Question Answering dataset for two such languages- Hindi and Marathi. Despite Hindi being the 3rd most spoken language worldwide, with 345 million speakers, and Marathi being the 11th most spoken language globally, with 83.2 million speakers, both languages face limited resources for building efficient Question Answering systems. To tackle the challenge of data scarcity, we have developed a novel approach for translating the SQuAD 2.0 dataset into Hindi and Marathi. We release the largest Question-Answering dataset available for these languages, with each dataset containing 28,000 samples. We evaluate the dataset on various architectures and release the best-performing models for both Hindi and Marathi, which will facilitate further research in these languages. Leveraging similarity tools, our method holds the potential to create datasets in diverse languages, thereby enhancing the understanding of natural language across varied linguistic contexts. Our fine-tuned models, code, and dataset will be made publicly available.

翻译：深度学习的最新进展推动了高度复杂系统的发展，这些系统对数据有着难以满足的需求。然而，为低资源语言构建良好的深度学习模型仍然是一项具有挑战性的任务。本文专注于为两种低资源语言——印地语和马拉地语——开发问答数据集。尽管印地语是全球使用人数第三多的语言（拥有3.45亿使用者），马拉地语是全球使用人数第11多的语言（拥有8320万使用者），但这两种语言在构建高效问答系统方面仍面临资源有限的问题。为应对数据稀缺的挑战，我们提出了一种新颖的方法，将SQuAD 2.0数据集翻译成印地语和马拉地语。我们发布了目前这两种语言可用的最大问答数据集，每个数据集包含28,000个样本。我们在多种架构上评估了该数据集，并发布了针对印地语和马拉地语的最佳性能模型，这将促进这两种语言的进一步研究。利用相似性工具，我们的方法具有为多种语言创建数据集的潜力，从而增强对跨语言语境的自然语言理解。我们微调后的模型、代码及数据集将公开发布。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日