Recent years have witnessed the success of question answering (QA), especially its potential to be a foundation paradigm for tackling diverse NLP tasks. However, obtaining sufficient data to build an effective and stable QA system still remains an open problem. For this problem, we introduce an iterative bootstrapping framework for QA data augmentation (named QASnowball), which can iteratively generate large-scale high-quality QA data based on a seed set of supervised examples. Specifically, QASnowball consists of three modules, an answer extractor to extract core phrases in unlabeled documents as candidate answers, a question generator to generate questions based on documents and candidate answers, and a QA data filter to filter out high-quality QA data. Moreover, QASnowball can be self-enhanced by reseeding the seed set to fine-tune itself in different iterations, leading to continual improvements in the generation quality. We conduct experiments in the high-resource English scenario and the medium-resource Chinese scenario, and the experimental results show that the data generated by QASnowball can facilitate QA models: (1) training models on the generated data achieves comparable results to using supervised data, and (2) pre-training on the generated data and fine-tuning on supervised data can achieve better performance. Our code and generated data will be released to advance further work.
翻译:近年来,问答领域取得了显著成功,尤其是其作为解决多种自然语言处理任务的基础范式的潜力。然而,获取足够数据以构建有效且稳定的问答系统仍是一个开放性问题。针对这一问题,我们提出了一种用于问答数据增强的迭代自举框架(命名为QASnowball),该框架能够基于有监督示例的种子集,迭代地生成大规模高质量问答数据。具体而言,QASnowball包含三个模块:答案抽取器(从未标注文档中抽取核心短语作为候选答案)、问题生成器(基于文档和候选答案生成问题)、以及问答数据过滤器(筛选高质量问答数据)。此外,QASnowball可通过重新播种种子集实现自我增强,从而在不同迭代中微调自身,持续提升生成质量。我们在高资源英语场景和中资源中文场景下进行了实验,结果表明QASnowball生成的数据能有效促进问答模型:(1)在生成数据上训练的模型可达到与使用有监督数据相当的性能;(2)在生成数据上预训练再在有监督数据上微调可取得更优表现。我们的代码和生成数据将开源以推动后续研究。