As it is cumbersome and expensive to acquire a huge amount of data for training neural dialog models, data augmentation is proposed to effectively utilize existing training samples. However, current data augmentation techniques on the dialog generation task mostly augment all cases in the training dataset without considering the intrinsic attributes between different cases. We argue that not all cases are beneficial for augmentation task, and the cases suitable for augmentation should obey the following two attributes: (1) low-quality (the dialog model cannot generate a high-quality response for the case), (2) representative (the case should represent the property of the whole dataset). Herein, we explore this idea by proposing a Selective Data Augmentation framework (SDA) for the response generation task. SDA employs a dual adversarial network to select the lowest quality and most representative data points for augmentation in one stage. Extensive experiments conducted on two publicly available datasets, i.e., DailyDialog and OpenSubtitles, show that our framework can improve the response generation performance with respect to various metrics.
翻译:由于获取大量训练神经对话模型的数据既繁琐又昂贵,数据增强方法被提出来有效利用现有训练样本。然而,当前针对对话生成任务的数据增强技术大多对所有训练数据集中的案例进行增强,而未考虑不同案例之间的内在属性。我们认为,并非所有案例都对增强任务有益,适合增强的案例应满足以下两个属性:(1) 低质量(对话模型无法为该案例生成高质量响应),(2) 代表性(该案例应代表整个数据集的特性)。在此,我们通过提出一种用于响应生成任务的选择性数据增强框架(SDA)来探索这一思路。SDA采用双对抗网络,在一个阶段内选择最低质量且最具代表性的数据点进行增强。在两个公开数据集(即DailyDialog和OpenSubtitles)上进行的大量实验表明,我们的框架能够在各种评估指标上提升响应生成性能。