Multimodal large language models (MLLMs) are prone to non-factual or outdated knowledge issues, which can manifest as misreading and misrecognition errors due to the complexity of multimodal knowledge. Previous benchmarks have not systematically analyzed the performance of editing methods in correcting these two error types. To better represent and correct these errors, we decompose multimodal knowledge into its visual and textual components. Different error types correspond to different editing formats, which edit distinct parts of the multimodal knowledge. We present MC-MKE, a fine-grained Multimodal Knowledge Editing benchmark emphasizing Modality Consistency. Our benchmark facilitates independent correction of misreading and misrecognition errors by editing the corresponding knowledge component. We evaluate four multimodal knowledge editing methods on MC-MKE, revealing their limitations, particularly in terms of modality consistency. Our work highlights the challenges posed by multimodal knowledge editing and motivates further research in developing effective techniques for this task.
翻译:多模态大语言模型(MLLMs)容易受到非事实或过时知识问题的影响,由于多模态知识的复杂性,这些问题可能表现为误读和误识别错误。以往的基准测试未能系统分析编辑方法在纠正这两类错误方面的性能。为了更好地表征和纠正这些错误,我们将多模态知识分解为其视觉和文本组成部分。不同的错误类型对应不同的编辑格式,这些格式编辑多模态知识的不同部分。我们提出了MC-MKE,一个强调模态一致性的细粒度多模态知识编辑基准。我们的基准通过编辑相应的知识组成部分,促进了对误读和误识别错误的独立纠正。我们在MC-MKE上评估了四种多模态知识编辑方法,揭示了它们的局限性,特别是在模态一致性方面。我们的工作凸显了多模态知识编辑带来的挑战,并激励了为这一任务开发有效技术的进一步研究。