Supplying data augmentation to conversational question answering (CQA) can effectively improve model performance. However, there is less improvement from single-turn datasets in CQA due to the distribution gap between single-turn and multi-turn datasets. On the other hand, while numerous single-turn datasets are available, we have not utilized them effectively. To solve this problem, we propose a novel method to convert single-turn datasets to multi-turn datasets. The proposed method consists of three parts, namely, a QA pair Generator, a QA pair Reassembler, and a question Rewriter. Given a sample consisting of context and single-turn QA pairs, the Generator obtains candidate QA pairs and a knowledge graph based on the context. The Reassembler utilizes the knowledge graph to get sequential QA pairs, and the Rewriter rewrites questions from a conversational perspective to obtain a multi-turn dataset S2M. Our experiments show that our method can synthesize effective training resources for CQA. Notably, S2M ranks 1st place on the QuAC leaderboard at the time of submission (Aug 24th, 2022).
翻译:为对话式问答(CQA)提供数据增强可有效提升模型性能。然而,由于单轮数据集与多轮数据集之间存在分布差异,单轮数据集在CQA中的改进效果有限。另一方面,尽管存在大量单轮数据集,我们尚未对其加以有效利用。为解决这一问题,我们提出一种将单轮数据集转换为多轮数据集的新方法。该方法由三部分组成:问答对生成器、问答对重组器和问题改写器。给定包含上下文和单轮问答对的样本,生成器基于上下文获取候选问答对及知识图谱。重组器利用知识图谱得到序列化问答对,改写器则从对话视角改写问题,最终获得多轮数据集S2M。实验表明,我们的方法可为CQA合成有效的训练资源。值得注意的是,S2M在提交时(2022年8月24日)位列QuAC排行榜首位。