The goal of document-grounded dialogue (DocGD) is to generate a response by grounding the evidence in a supporting document in accordance with the dialogue context. This process involves four variables that are causally connected. Recently, task-specific pre-training has greatly boosted performances on many downstream tasks. Existing DocGD methods, however, continue to rely on general pre-trained language models without a specifically tailored pre-training approach that explicitly captures the causal relationships. To tackle this issue, we are the first to present a causally-complete dataset construction strategy for building million-level DocGD pre-training corpora. To better capture causality, we further propose a causally-perturbed pre-training strategy, which introduces causal perturbations on the variables and optimizes the overall causal effect. Experiments on three benchmark datasets demonstrate that our causal pre-training achieves considerable and consistent improvements under fully-supervised, low-resource, few-shot, and zero-shot settings.
翻译:文档依据对话(DocGD)的目标是根据对话上下文,在支持性文档中定位证据并生成响应。该过程涉及四个具有因果关联的变量。近年来,特定任务的预训练极大地提升了许多下游任务的性能。然而,现有DocGD方法仍依赖通用预训练语言模型,缺乏明确捕捉因果关系的专门化预训练方法。为解决这一问题,我们首次提出一种因果完备的数据集构建策略,用于构建百万级别的DocGD预训练语料库。为更好地捕捉因果关系,我们进一步提出一种因果扰动预训练策略,通过引入变量层面的因果扰动来优化整体因果效应。在三个基准数据集上的实验表明,我们的因果预训练在全监督、低资源、少样本和零样本设置下均取得了显著且一致的性能提升。