A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation

Current state-of-the-art dialogue systems heavily rely on extensive training datasets. However, challenges arise in domains where domain-specific training datasets are insufficient or entirely absent. To tackle this challenge, we propose a novel data \textbf{A}ugmentation framework for \textbf{M}ulti-\textbf{D}omain \textbf{D}ialogue \textbf{G}eneration, referred to as \textbf{AMD$^2$G}. The AMD$^2$G framework consists of a data augmentation process and a two-stage training approach: domain-agnostic training and domain adaptation training. We posit that domain corpora are a blend of domain-agnostic and domain-specific features, with certain representation patterns shared among diverse domains. Domain-agnostic training aims to enable models to learn these common expressive patterns. To construct domain-agnostic dialogue corpora, we employ a \textit{\textbf{de-domaining}} data processing technique used to remove domain-specific features. By mitigating the effects of domain-specific features, the model trained on the de-domained corpora can effectively learn common expression patterns in different domains. Subsequently, we adapt the learned domain-agnostic features to the target domain through domain adaptation training. We conduct experiments on Chinese dialogue datasets from five different domains and show that AMD$^2$G achieves superior performance compared to both direct training on the target domain corpus and collective training on all five domain corpora. Our work underscores AMD$^2$G as a viable alternative solution for low-resource multi-domain dialogue generation. Code and data associated with our work are available on GitHub repository$^{\text 1}$.

翻译：当前最先进的对话系统严重依赖于大规模训练数据集。然而，在领域特定训练数据不足或完全缺失的领域中，模型训练面临严峻挑战。为应对这一挑战，我们提出了一种新颖的面向多领域对话生成的数据增强框架，称为AMD$^2$G。该框架包含数据增强流程和两阶段训练方法：领域无关训练与领域自适应训练。我们认为领域语料是领域无关特征与领域特定特征的混合体，不同领域间存在某些共享的表征模式。领域无关训练旨在使模型学习这些通用表达模式。为构建领域无关对话语料，我们采用一种去领域化数据处理技术来移除领域特定特征。通过削弱领域特定特征的影响，在去领域化语料上训练的模型能够有效学习不同领域的通用表达模式。随后，我们通过领域自适应训练将习得的领域无关特征适配至目标领域。我们在五个不同领域的中文对话数据集上进行实验，结果表明AMD$^2$G相比直接在目标领域语料上训练及在所有五个领域语料上联合训练均取得更优性能。本工作证明AMD$^2$G是低资源多领域对话生成的有效替代方案。相关代码与数据已发布于GitHub仓库$^{\text 1}$。