GeMQuAD : Generating Multilingual Question Answering Datasets from Large Language Models using Few Shot Learning

from arxiv, Accepted to The 37th International Conference on Neural Information Processing Systems (NeurIPS 2023)December 10-16, 2023 - SyntheticData4ML workshop, New Orleans, United States https://neurips.cc/Conferences/2023

The emergence of Large Language Models (LLMs) with capabilities like In-Context Learning (ICL) has ushered in new possibilities for data generation across various domains while minimizing the need for extensive data collection and modeling techniques. Researchers have explored ways to use this generated synthetic data to optimize smaller student models for reduced deployment costs and lower latency in downstream tasks. However, ICL-generated data often suffers from low quality as the task specificity is limited with few examples used in ICL. In this paper, we propose GeMQuAD - a semi-supervised learning approach, extending the WeakDAP framework, applied to a dataset generated through ICL with just one example in the target language using AlexaTM 20B Seq2Seq LLM. Through our approach, we iteratively identify high-quality data to enhance model performance, especially for low-resource multilingual setting in the context of Extractive Question Answering task. Our framework outperforms the machine translation-augmented model by 0.22/1.68 F1/EM (Exact Match) points for Hindi and 0.82/1.37 F1/EM points for Spanish on the MLQA dataset, and it surpasses the performance of model trained on an English-only dataset by 5.05/6.50 F1/EM points for Hindi and 3.81/3.69 points F1/EM for Spanish on the same dataset. Notably, our approach uses a pre-trained LLM for generation with no fine-tuning (FT), utilizing just a single annotated example in ICL to generate data, providing a cost-effective development process.

翻译：大型语言模型（LLMs）的涌现带来了上下文学习（ICL）等能力，为各领域的数据生成开辟了新的可能性，同时极大减少了广泛的数据收集和建模技术需求。研究者已探索如何利用这些生成的合成数据优化较小的学生模型，以降低下游任务的部署成本和延迟。然而，ICL生成的数据常因任务特异性不足而质量较低——其使用的少量示例限制了任务适应性。本文提出GeMQuAD——一种半监督学习方法，扩展了WeakDAP框架，应用于通过ICL仅使用目标语言中的一个示例、基于AlexaTM 20B Seq2Seq LLM生成的数据集。通过该方法，我们迭代识别高质量数据以提升模型性能，尤其针对低资源多语言场景下的抽取式问答任务。我们的框架在MLQA数据集上，对印地语（F1/完全匹配EM）分别比机器翻译增强模型高0.22/1.68个点，对西班牙语高0.82/1.37个点；且在同一数据集上，比仅用英语数据集训练的模型在印地语上高5.05/6.50个F1/EM点，在西班牙语上高3.81/3.69个F1/EM点。值得注意的是，该方法使用预训练LLM进行生成而无需微调（FT），仅需ICL中一个标注示例即可生成数据，实现了成本效益高的开发流程。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日