Question answering (QA) models often rely on large-scale training datasets, which necessitates the development of a data generation framework to reduce the cost of manual annotations. Although several recent studies have aimed to generate synthetic questions with single-span answers, no study has been conducted on the creation of list questions with multiple, non-contiguous spans as answers. To address this gap, we propose \ours, an automated framework for generating list QA datasets from unlabeled corpora. We first convert a passage from Wikipedia or PubMed into a summary and extract named entities from the summarized text as candidate answers. This allows us to select answers that are semantically correlated in context and is, therefore, suitable for constructing list questions. We then create questions using an off-the-shelf question generator with the extracted entities and original passage. Finally, iterative filtering and answer expansion are performed to ensure the accuracy and completeness of the answers. Using our synthetic data, we significantly improve the performance of the previous best list QA models by exact-match F1 scores of 5.0 on MultiSpanQA, 1.9 on Quoref, and 2.8 averaged across three BioASQ benchmarks.
翻译:摘要:问答(QA)模型通常依赖大规模训练数据集,这促使开发数据生成框架以降低人工标注成本。尽管近期多项研究致力于生成包含单跨度答案的合成问题,但尚未有研究针对答案由多个非连续跨度构成的列表问题创建。为填补这一空白,我们提出\ours,一个从无标注语料库自动生成列表QA数据集的框架。首先,我们将维基百科或PubMed中的段落转换为摘要,并从摘要文本中提取命名实体作为候选答案。这使得我们能够选择上下文语义关联的答案,从而适用于构建列表问题。接着,我们利用现成的问题生成器,结合提取的实体与原始段落生成问题。最后,通过迭代过滤与答案扩展确保答案的准确性与完整性。基于合成数据,我们将此前最优列表QA模型的精确匹配F1分数在MultiSpanQA上提升5.0,在Quoref上提升1.9,在三个BioASQ基准测试中平均提升2.8。