Audio-visual continual test-time adaptation involves continually adapting a source audio-visual model at test-time, to unlabeled non-stationary domains, where either or both modalities can be distributionally shifted, which hampers online cross-modal learning and eventually leads to poor accuracy. While previous works have tackled this problem, we find that SOTA methods suffer from catastrophic forgetting, where the model's performance drops well below the source model due to continual parameter updates at test-time. In this work, we first show that adapting only the modality fusion layer to a target domain not only improves performance on that domain but can also enhance performance on subsequent domains. Based on this strong cross-task transferability of the fusion layer's parameters, we propose a method, $\texttt{AV-CTTA}$, that improves test-time performance of the models without access to any source data. Our approach works by using a selective parameter retrieval mechanism that dynamically retrieves the best fusion layer parameters from a buffer using only a small batch of test data. These parameters are then integrated into the model, adapted to the current test distribution, and saved back for future use. Extensive experiments on benchmark datasets involving unimodal and bimodal corruptions show our proposed $\texttt{AV-CTTA}$ significantly outperforms existing methods while minimizing catastrophic forgetting.
翻译:视听持续测试时适应旨在测试阶段持续调整源视听模型,以应对未标注的非平稳域,其中单模态或双模态均可能发生分布偏移,这阻碍了在线跨模态学习并最终导致准确率下降。现有研究虽已尝试解决此问题,但我们发现当前最优方法仍受灾难性遗忘困扰:由于测试阶段的持续参数更新,模型性能会显著低于源模型。本研究首先证明,仅针对目标域调整模态融合层不仅能提升该域性能,还可增强对后续域的适应能力。基于融合层参数强大的跨任务可迁移性,我们提出一种无需源数据的方法 $\texttt{AV-CTTA}$,以提升模型在测试时的性能。该方法通过选择性参数检索机制,仅利用少量测试数据动态从缓冲区获取最优融合层参数,将其整合至模型后适配当前测试分布,并回存以供后续使用。在包含单模态与双模态干扰的基准数据集上的大量实验表明,所提出的 $\texttt{AV-CTTA}$ 在显著超越现有方法的同时,能最大程度抑制灾难性遗忘。