Data augmentation (DA) has gained widespread popularity in deep speaker models due to its ease of implementation and significant effectiveness. It enriches training data by simulating real-life acoustic variations, enabling deep neural networks to learn speaker-related representations while disregarding irrelevant acoustic variations, thereby improving robustness and generalization. However, a potential issue with the vanilla DA is augmentation residual, i.e., unwanted distortion caused by different types of augmentation. To address this problem, this paper proposes a novel approach called adversarial data augmentation (A-DA) which combines DA with adversarial learning. Specifically, it involves an additional augmentation classifier to categorize various augmentation types used in data augmentation. This adversarial learning empowers the network to generate speaker embeddings that can deceive the augmentation classifier, making the learned speaker embeddings more robust in the face of augmentation variations. Experiments conducted on VoxCeleb and CN-Celeb datasets demonstrate that our proposed A-DA outperforms standard DA in both augmentation matched and mismatched test conditions, showcasing its superior robustness and generalization against acoustic variations.
翻译:数据增强(DA)因其实现简便且效果显著而在深度说话人模型中广泛流行。该方法通过模拟真实环境中的声学变化来丰富训练数据,使深度神经网络能够学习与说话人相关的表征,同时忽略无关的声学变化,从而提升模型的鲁棒性和泛化能力。然而,传统数据增强存在增强残差问题,即不同增强类型导致的不必要失真。为解决这一问题,本文提出一种名为对抗性数据增强(A-DA)的新方法,将数据增强与对抗学习相结合。具体而言,该方法引入一个额外的增强分类器,用于区分数据增强中使用的各种增强类型。这种对抗学习机制使网络能够生成可欺骗增强分类器的说话人嵌入,从而使学得的说话人嵌入在面对增强变化时具有更强的鲁棒性。在VoxCeleb和CN-Celeb数据集上的实验表明,所提出的A-DA在增强匹配与不匹配测试条件下均优于标准DA,展现出对声学变化更强的鲁棒性和泛化能力。