Recently, neural networks have shown impressive progress across diverse fields, with speech processing being no exception. However, recent breakthroughs in this area require extensive offline training using large datasets and tremendous computing resources. Unfortunately, these models struggle to retain their previously acquired knowledge when learning new tasks continually, and retraining from scratch is almost always impractical. In this paper, we investigate the problem of learning sequence-to-sequence models for spoken language understanding in a class-incremental learning (CIL) setting and we propose COCONUT, a CIL method that relies on the combination of experience replay and contrastive learning. Through a modified version of the standard supervised contrastive loss applied only to the rehearsal samples, COCONUT preserves the learned representations by pulling closer samples from the same class and pushing away the others. Moreover, we leverage a multimodal contrastive loss that helps the model learn more discriminative representations of the new data by aligning audio and text features. We also investigate different contrastive designs to combine the strengths of the contrastive loss with teacher-student architectures used for distillation. Experiments on two established SLU datasets reveal the effectiveness of our proposed approach and significant improvements over the baselines. We also show that COCONUT can be combined with methods that operate on the decoder side of the model, resulting in further metrics improvements.
翻译:近年来,神经网络在多个领域展现出令人瞩目的进展,语音处理领域亦不例外。然而,该领域的最新突破通常需要利用大规模数据集和巨量计算资源进行广泛的离线训练。遗憾的是,这些模型在持续学习新任务时难以保持先前习得的知识,而从头开始重新训练几乎总是不切实际的。本文研究了在类增量学习(CIL)设置下为口语理解训练序列到序列模型的问题,并提出了COCONUT——一种基于经验回放与对比学习相结合的CIL方法。通过仅对回放样本应用改进版的标准监督对比损失函数,COCONUT通过拉近同类样本距离并推远异类样本的方式,有效保持了已学习的表征。此外,我们利用多模态对比损失,通过对齐音频与文本特征来帮助模型学习更具判别力的新数据表征。我们还研究了不同的对比学习设计方案,以结合对比损失与用于知识蒸馏的师生架构的优势。在两个成熟的SLU数据集上的实验验证了所提方法的有效性,并显示出相对于基线模型的显著提升。我们还证明COCONUT可与作用于模型解码器侧的方法相结合,从而获得进一步的指标提升。