Models for conversational question answering (ConvQA) over knowledge graphs (KGs) are usually trained and tested on benchmarks of gold QA pairs. This implies that training is limited to surface forms seen in the respective datasets, and evaluation is on a small set of held-out questions. Through our proposed framework REIGN, we take several steps to remedy this restricted learning setup. First, we systematically generate reformulations of training questions to increase robustness of models to surface form variations. This is a particularly challenging problem, given the incomplete nature of such questions. Second, we guide ConvQA models towards higher performance by feeding it only those reformulations that help improve their answering quality, using deep reinforcement learning. Third, we demonstrate the viability of training major model components on one benchmark and applying them zero-shot to another. Finally, for a rigorous evaluation of robustness for trained models, we use and release large numbers of diverse reformulations generated by prompting GPT for benchmark test sets (resulting in 20x increase in sizes). Our findings show that ConvQA models with robust training via reformulations, significantly outperform those with standard training from gold QA pairs only.
翻译:基于知识图谱的对话式问答模型通常在包含标准问答对的基准数据集上进行训练和测试。这意味着训练局限于各数据集中出现的表面形式,且评估仅在少量保留问题上进行。通过我们提出的框架REIGN,我们采取了多个步骤来改进这一受限的学习设置。首先,我们系统性地生成训练问题的改写版本,以提高模型对表面形式变化的鲁棒性。考虑到这类问题的不完整性,这尤为具有挑战性。其次,我们通过深度强化学习,仅向对话式问答模型输入有助于提升回答质量的改写版本,引导其获得更优性能。第三,我们展示了在一个基准数据集上训练主要模型组件并零样本应用到另一个数据集的可行性。最后,为严格评估所训练模型的鲁棒性,我们利用提示GPT生成基准测试集的大量多样化改写版本(使得规模扩大20倍),并使用和发布这些数据。研究结果表明,通过改写进行鲁棒训练的对话式问答模型,其性能显著优于仅使用标准问答对进行训练的模型。