Recently, neural networks have shown impressive progress across diverse fields, with speech processing being no exception. However, recent breakthroughs in this area require extensive offline training using large datasets and tremendous computing resources. Unfortunately, these models struggle to retain their previously acquired knowledge when learning new tasks continually, and retraining from scratch is almost always impractical. In this paper, we investigate the problem of learning sequence-to-sequence models for spoken language understanding in a class-incremental learning (CIL) setting and we propose COCONUT, a CIL method that relies on the combination of experience replay and contrastive learning. Through a modified version of the standard supervised contrastive loss applied only to the rehearsal samples, COCONUT preserves the learned representations by pulling closer samples from the same class and pushing away the others. Moreover, we leverage a multimodal contrastive loss that helps the model learn more discriminative representations of the new data by aligning audio and text features. We also investigate different contrastive designs to combine the strengths of the contrastive loss with teacher-student architectures used for distillation. Experiments on two established SLU datasets reveal the effectiveness of our proposed approach and significant improvements over the baselines. We also show that COCONUT can be combined with methods that operate on the decoder side of the model, resulting in further metrics improvements.
翻译:近年来,神经网络在包括语音处理在内的多个领域取得了显著进展。然而,该领域的最新突破需要利用大规模数据集和巨大计算资源进行大量离线训练。遗憾的是,当持续学习新任务时,这些模型难以保留先前获取的知识,而从头重新训练几乎总是不切实际的。本文研究了在类增量学习(CIL)环境下,针对口语理解任务的序列到序列模型学习问题,并提出了一种结合经验回放与对比学习的CIL方法COCONUT。通过仅对回放样本应用改进版的标准监督对比损失,COCONUT通过拉近同一类样本并推开其他样本的方式保留已学习的表征。此外,我们利用多模态对比损失,通过对齐音频和文本特征来帮助模型学习新数据的更具判别性的表征。我们还研究了不同的对比设计,以结合对比损失与用于知识蒸馏的师生架构的优势。在两个已建立的SLU数据集上的实验揭示了所提方法的有效性,并相比基线取得了显著改进。我们还证明,COCONUT可与模型解码器端的方法相结合,从而进一步提升指标表现。