Currently, little research has been done on knowledge editing for Large Vision-Language Models (LVLMs). Editing LVLMs faces the challenge of effectively integrating diverse modalities (image and text) while ensuring coherent and contextually relevant modifications. An existing benchmark has three metrics (Reliability, Locality and Generality) to measure knowledge editing for LVLMs. However, the benchmark falls short in the quality of generated images used in evaluation and cannot assess whether models effectively utilize edited knowledge in relation to the associated content. We adopt different data collection methods to construct a new benchmark, $\textbf{KEBench}$, and extend new metric (Portability) for a comprehensive evaluation. Leveraging a multimodal knowledge graph, our image data exhibits clear directionality towards entities. This directional aspect can be further utilized to extract entity-related knowledge and form editing data. We conducted experiments of different editing methods on five LVLMs, and thoroughly analyze how these methods impact the models. The results reveal strengths and deficiencies of these methods and, hopefully, provide insights into potential avenues for future research.
翻译:目前,针对大型视觉-语言模型(LVLMs)的知识编辑研究仍较为有限。编辑LVLMs面临的挑战在于有效融合不同模态(图像与文本)的同时,确保修改内容的一致性与上下文相关性。现有基准提出了三个评估指标(可靠性、局部性与通用性)来衡量LVLMs的知识编辑效果。然而,该基准在评估所用生成图像的质量上存在不足,且无法检验模型能否有效利用与关联内容相关的编辑后知识。我们采用不同的数据收集方法构建了一个新基准 $\textbf{KEBench}$,并扩展了新指标(可移植性)以实现更全面的评估。借助多模态知识图谱,我们的图像数据展现出清晰的实体指向性。这种指向性可进一步用于提取实体相关知识并形成编辑数据。我们在五个LVLMs上进行了不同编辑方法的实验,并深入分析了这些方法对模型的影响。研究结果揭示了这些方法的优势与不足,有望为未来研究方向提供启示。