This paper introduces UQA, a novel dataset for question answering and text comprehension in Urdu, a low-resource language with over 70 million native speakers. UQA is generated by translating the Stanford Question Answering Dataset (SQuAD2.0), a large-scale English QA dataset, using a technique called EATS (Enclose to Anchor, Translate, Seek), which preserves the answer spans in the translated context paragraphs. The paper describes the process of selecting and evaluating the best translation model among two candidates: Google Translator and Seamless M4T. The paper also benchmarks several state-of-the-art multilingual QA models on UQA, including mBERT, XLM-RoBERTa, and mT5, and reports promising results. For XLM-RoBERTa-XL, we have an F1 score of 85.99 and 74.56 EM. UQA is a valuable resource for developing and testing multilingual NLP systems for Urdu and for enhancing the cross-lingual transferability of existing models. Further, the paper demonstrates the effectiveness of EATS for creating high-quality datasets for other languages and domains. The UQA dataset and the code are publicly available at www.github.com/sameearif/UQA.
翻译:本文介绍了UQA,一个用于乌尔都语问答与文本理解的新型数据集。乌尔都语是一种拥有超过7000万母语使用者的低资源语言。UQA通过翻译大规模英文问答数据集斯坦福问答数据集(SQuAD2.0)生成,采用了一种名为EATS(锚定封装、翻译、定位)的技术,该技术能在翻译后的上下文段落中保留答案片段。本文详细阐述了从两个候选翻译模型(Google Translator与Seamless M4T)中筛选并评估最佳模型的过程。研究还在UQA上对多种前沿多语言问答模型进行了基准测试,包括mBERT、XLM-RoBERTa和mT5,并报告了具有前景的结果。其中XLM-RoBERTa-XL模型的F1分数达到85.99,精确匹配率为74.56。UQA是为乌尔都语开发测试多语言自然语言处理系统、增强现有模型跨语言迁移能力的宝贵资源。此外,本文论证了EATS技术为其他语言和领域创建高质量数据集的有效性。UQA数据集及相关代码已公开于www.github.com/sameearif/UQA。