Large multimodal language models (MLLMs) have revolutionized natural language processing and visual understanding, but often contain outdated or inaccurate information. Current multimodal knowledge editing evaluations are limited in scope and potentially biased, focusing on narrow tasks and failing to assess the impact on in-domain samples. To address these issues, we introduce ComprehendEdit, a comprehensive benchmark comprising eight diverse tasks from multiple datasets. We propose two novel metrics: Knowledge Generalization Index (KGI) and Knowledge Preservation Index (KPI), which evaluate editing effects on in-domain samples without relying on AI-synthetic samples. Based on insights from our framework, we establish Hierarchical In-Context Editing (HICE), a baseline method employing a two-stage approach that balances performance across all metrics. This study provides a more comprehensive evaluation framework for multimodal knowledge editing, reveals unique challenges in this field, and offers a baseline method demonstrating improved performance. Our work opens new perspectives for future research and provides a foundation for developing more robust and effective editing techniques for MLLMs. The ComprehendEdit benchmark and implementation code are available at https://github.com/yaohui120/ComprehendEdit.
翻译:大型多模态语言模型(MLLMs)在自然语言处理和视觉理解领域带来了革命性变革,但其内部知识往往存在过时或不准确的问题。当前的多模态知识编辑评估方法在范围上存在局限且可能存在偏差,主要聚焦于狭窄的任务,未能充分评估对领域内样本的影响。为解决这些问题,我们提出了ComprehendEdit,这是一个包含来自多个数据集的八项多样化任务的综合性基准。我们提出了两个新颖的评估指标:知识泛化指数(KGI)与知识保持指数(KPI),用于在不依赖AI合成样本的情况下,评估编辑对领域内样本的影响效果。基于我们框架的分析洞见,我们建立了分层上下文编辑(HICE)这一基线方法,该方法采用两阶段策略,旨在平衡所有评估指标上的性能。本研究为多模态知识编辑提供了一个更全面的评估框架,揭示了该领域独特的挑战,并提供了一个展现性能提升的基线方法。我们的工作为未来研究开辟了新的视角,并为开发更鲁棒、更有效的MLLMs编辑技术奠定了基础。ComprehendEdit基准及实现代码发布于https://github.com/yaohui120/ComprehendEdit。