Audio-visual video recognition (AVVR) aims to integrate audio and visual clues to categorize videos accurately. While existing methods train AVVR models using provided datasets and achieve satisfactory results, they struggle to retain historical class knowledge when confronted with new classes in real-world situations. Currently, there are no dedicated methods for addressing this problem, so this paper concentrates on exploring Class Incremental Audio-Visual Video Recognition (CIAVVR). For CIAVVR, since both stored data and learned model of past classes contain historical knowledge, the core challenge is how to capture past data knowledge and past model knowledge to prevent catastrophic forgetting. We introduce Hierarchical Augmentation and Distillation (HAD), which comprises the Hierarchical Augmentation Module (HAM) and Hierarchical Distillation Module (HDM) to efficiently utilize the hierarchical structure of data and models, respectively. Specifically, HAM implements a novel augmentation strategy, segmental feature augmentation, to preserve hierarchical model knowledge. Meanwhile, HDM introduces newly designed hierarchical (video-distribution) logical distillation and hierarchical (snippet-video) correlative distillation to capture and maintain the hierarchical intra-sample knowledge of each data and the hierarchical inter-sample knowledge between data, respectively. Evaluations on four benchmarks (AVE, AVK-100, AVK-200, and AVK-400) demonstrate that the proposed HAD effectively captures hierarchical information in both data and models, resulting in better preservation of historical class knowledge and improved performance. Furthermore, we provide a theoretical analysis to support the necessity of the segmental feature augmentation strategy.
翻译:音频-视频识别旨在融合音频与视觉线索对视频进行准确分类。现有方法利用给定数据集训练模型并取得满意效果,但在面对现实场景中的新类别时难以保留历史类别知识。目前尚无专门方法解决该问题,因此本文聚焦于类增量音频-视频识别(CIAVVR)的探索。对于CIAVVR而言,由于存储的历史数据与已学习的旧类模型均包含历史知识,核心挑战在于如何捕获过往数据知识与模型知识以防止灾难性遗忘。我们提出层次增强与蒸馏方法,包含层次增强模块与层次蒸馏模块,分别高效利用数据与模型的层次结构。具体而言,HAM实施了一种新型增强策略——片段特征增强,以保留层次化模型知识;而HDM引入新设计的层次化(视频-分布)逻辑蒸馏与层次化(片段-视频)关联蒸馏,分别捕获并维护每个数据内部的层次化样本内知识以及数据之间的层次化样本间知识。在四个基准数据集(AVE、AVK-100、AVK-200、AVK-400)上的评估表明,所提出的HAD能有效捕获数据与模型中的层次化信息,从而更好地保留历史类别知识并提升性能。此外,我们提供了理论分析以支撑片段特征增强策略的必要性。