Visual Question Answering (VQA) requires models to reason over multimodal information, combining visual and textual data. With the development of continual learning, significant progress has been made in retaining knowledge and adapting to new information in the VQA domain. However, current methods often struggle with balancing knowledge retention, adaptation, and robust feature representation. To address these challenges, we propose a novel framework with adaptive memory allocation and global noise filtering called MacVQA for visual question answering. MacVQA fuses visual and question information while filtering noise to ensure robust representations, and employs prototype-based memory allocation to optimize feature quality and memory usage. These designs enable MacVQA to balance knowledge acquisition, retention, and compositional generalization in continual VQA learning. Experiments on ten continual VQA tasks show that MacVQA outperforms existing baselines, achieving 43.38% average accuracy and 2.32% average forgetting on standard tasks, and 42.53% average accuracy and 3.60% average forgetting on novel composition tasks.
翻译:视觉问答(VQA)要求模型能够对多模态信息进行推理,结合视觉与文本数据。随着持续学习的发展,VQA领域在知识保留与适应新信息方面已取得显著进展。然而,现有方法往往难以在知识保留、适应能力与鲁棒特征表示之间取得平衡。为应对这些挑战,我们提出了一种新颖的框架MacVQA,其具备自适应记忆分配与全局噪声过滤机制,用于视觉问答任务。MacVQA在融合视觉与问题信息的同时过滤噪声以确保表征的鲁棒性,并采用基于原型的记忆分配策略以优化特征质量与内存使用效率。这些设计使得MacVQA能够在持续VQA学习中平衡知识获取、知识保留与组合泛化能力。在十项持续VQA任务上的实验表明,MacVQA优于现有基线方法,在标准任务上取得了43.38%的平均准确率与2.32%的平均遗忘率,在新颖组合任务上取得了42.53%的平均准确率与3.60%的平均遗忘率。