Long-horizon multimodal agents depend on external memory; however, similarity-based retrieval often surfaces stale, low-credibility, or conflicting items, which can trigger overconfident errors. We propose Multimodal Memory Agent (MMA), which assigns each retrieved memory item a dynamic reliability score by combining source credibility, temporal decay, and conflict-aware network consensus, and uses this signal to reweight evidence and abstain when support is insufficient. We also introduce MMA-Bench, a programmatically generated benchmark for belief dynamics with controlled speaker reliability and structured text-vision contradictions. Using this framework, we uncover the "Visual Placebo Effect", revealing how RAG-based agents inherit latent visual biases from foundation models. On FEVER, MMA matches baseline accuracy while reducing variance by 35.2% and improving selective utility; on LoCoMo, a safety-oriented configuration improves actionable accuracy and reduces wrong answers; on MMA-Bench, MMA reaches 41.18% Type-B accuracy in Vision mode, while the baseline collapses to 0.0% under the same protocol. Code: https://github.com/AIGeeksGroup/MMA.
翻译:长视野多模态智能体依赖外部记忆系统,但基于相似性的检索常返回陈旧、低可信度或相互冲突的记忆条目,进而引发过度自信的错误。我们提出多模态记忆代理(MMA),该方法通过综合来源可信度、时间衰减因子及冲突感知的网络共识,为每个检索到的记忆条目赋予动态可靠性评分,并利用该信号对证据进行重加权,在支持不足时主动弃权。我们还构建了MMA-Bench——一个通过程序化生成的基准测试集,用于评估信念动态,其特点在于可控的说话者可靠性与结构化的文本-视觉矛盾。借助该框架,我们揭示了"视觉安慰剂效应",阐明了基于检索增强生成(RAG)的智能体如何继承基础模型中的隐性视觉偏差。在FEVER数据集上,MMA在保持基线准确率的同时,将方差降低35.2%并提升选择性效用;在安全导向的LoCoMo配置中,可操作准确率得到改善且错误答案减少;在MMA-Bench的视觉模式下,MMA达到41.18%的B类准确率,而相同协议下基线模型性能崩溃至0.0%。代码地址:https://github.com/AIGeeksGroup/MMA。