Models for conversational question answering (ConvQA) over knowledge graphs (KGs) are usually trained and tested on benchmarks of gold QA pairs. This implies that training is limited to surface forms seen in the respective datasets, and evaluation is on a small set of held-out questions. Through our proposed framework REIGN, we take several steps to remedy this restricted learning setup. First, we systematically generate reformulations of training questions to increase robustness of models to surface form variations. This is a particularly challenging problem, given the incomplete nature of such questions. Second, we guide ConvQA models towards higher performance by feeding it only those reformulations that help improve their answering quality, using deep reinforcement learning. Third, we demonstrate the viability of training major model components on one benchmark and applying them zero-shot to another. Finally, for a rigorous evaluation of robustness for trained models, we use and release large numbers of diverse reformulations generated by prompting GPT for benchmark test sets (resulting in 20x increase in sizes). Our findings show that ConvQA models with robust training via reformulations, significantly outperform those with standard training from gold QA pairs only.
翻译:针对知识图谱的对话问答(ConvQA)模型通常使用标准问答对基准进行训练和测试。这意味着训练局限于各自数据集中出现的表面形式,而评估则仅限于少量保留问题。通过我们提出的REIGN框架,我们采取多项措施来改进这一受限的学习设置。首先,我们系统性地生成训练问题的改写形式,以增强模型对表面形式变化的鲁棒性。考虑到这类问题的不完整性,这是一个极具挑战性的问题。其次,我们通过深度强化学习,仅向ConvQA模型提供有助于提升其回答质量的改写形式,从而引导模型实现更高性能。第三,我们证明在一个基准上训练的主要模型组件可以零样本迁移至另一个基准。最后,为严格评估训练模型的鲁棒性,我们通过提示GPT生成基准测试集的大量多样化改写(使规模扩大20倍),并使用和发布这些数据。研究结果表明,通过改写进行鲁棒训练的ConvQA模型,其性能显著优于仅使用标准问答对进行训练的模型。