The goal of document-grounded dialogue (DocGD) is to generate a response by grounding the evidence in a supporting document in accordance with the dialogue context. This process involves four variables that are causally connected. Recently, task-specific pre-training has greatly boosted performances on many downstream tasks. Existing DocGD methods, however, continue to rely on general pre-trained language models without a specifically tailored pre-training approach that explicitly captures the causal relationships. To tackle this issue, we are the first to present a causally-complete dataset construction strategy for building million-level DocGD pre-training corpora. To better capture causality, we further propose a causally-perturbed pre-training strategy, which introduces causal perturbations on the variables and optimizes the overall causal effect. Experiments on three benchmark datasets demonstrate that our causal pre-training achieves considerable and consistent improvements under fully-supervised, low-resource, few-shot, and zero-shot settings.
翻译:文档驱动对话(DocGD)的目标是通过依据对话上下文在支持文档中定位证据来生成回复。该过程涉及四个存在因果关联的变量。近年来,任务特定的预训练极大地提升了许多下游任务的性能。然而,现有DocGD方法仍依赖通用预训练语言模型,缺乏明确捕捉因果关系的定制化预训练方法。为解决这一问题,我们首次提出一种因果完备的数据集构建策略,用于构建百万级的DocGD预训练语料库。为更好地捕捉因果关系,我们进一步提出一种因果扰动预训练策略,通过对变量引入因果扰动并优化整体因果效应。在三个基准数据集上的实验表明,我们的因果预训练在全监督、低资源、少样本及零样本设置下均实现了显著且一致的性能提升。