Recent years have witnessed the success of question answering (QA), especially its potential to be a foundation paradigm for tackling diverse NLP tasks. However, obtaining sufficient data to build an effective and stable QA system still remains an open problem. For this problem, we introduce an iterative bootstrapping framework for QA data augmentation (named QASnowball), which can iteratively generate large-scale high-quality QA data based on a seed set of supervised examples. Specifically, QASnowball consists of three modules, an answer extractor to extract core phrases in unlabeled documents as candidate answers, a question generator to generate questions based on documents and candidate answers, and a QA data filter to filter out high-quality QA data. Moreover, QASnowball can be self-enhanced by reseeding the seed set to fine-tune itself in different iterations, leading to continual improvements in the generation quality. We conduct experiments in the high-resource English scenario and the medium-resource Chinese scenario, and the experimental results show that the data generated by QASnowball can facilitate QA models: (1) training models on the generated data achieves comparable results to using supervised data, and (2) pre-training on the generated data and fine-tuning on supervised data can achieve better performance. Our code and generated data will be released to advance further work.
翻译:近年来,问答系统取得了显著成功,尤其是其作为处理多种自然语言任务的基座范式潜力。然而,获取足够数据来构建有效且稳定的问答系统仍是一个开放性问题。针对这一问题,我们提出了一种用于问答数据增强的迭代式自举框架(名为QASnowball),该框架能够基于少量有监督示例的种子集,迭代生成大规模高质量问答数据。具体而言,QASnowball包含三个模块:答案抽取器,用于从未标注文档中抽取核心短语作为候选答案;问题生成器,基于文档和候选答案生成问题;以及问答数据过滤器,用于筛选高质量问答数据。此外,QASnowball可通过重新播种种子集进行自我增强,在不同迭代轮次中微调自身,从而持续提升生成质量。我们在高资源英语场景和中资源中文场景下开展了实验,结果表明QASnowball生成的数据能够促进问答模型:(1)在生成数据上训练模型可达到与使用有监督数据相当的效果;(2)在生成数据上预训练并在有监督数据上微调可取得更优性能。我们将公开代码和生成数据,以推动后续研究。