Recent multimodal large language models (MLLMs) have demonstrated significant potential in open-ended conversation, generating more accurate and personalized responses. However, their abilities to memorize, recall, and reason in sustained interactions within real-world scenarios remain underexplored. This paper introduces MMRC, a Multi-Modal Real-world Conversation benchmark for evaluating six core open-ended abilities of MLLMs: information extraction, multi-turn reasoning, information update, image management, memory recall, and answer refusal. With data collected from real-world scenarios, MMRC comprises 5,120 conversations and 28,720 corresponding manually labeled questions, posing a significant challenge to existing MLLMs. Evaluations on 20 MLLMs in MMRC indicate an accuracy drop during open-ended interactions. We identify four common failure patterns: long-term memory degradation, inadequacies in updating factual knowledge, accumulated assumption of error propagation, and reluctance to say no. To mitigate these issues, we propose a simple yet effective NOTE-TAKING strategy, which can record key information from the conversation and remind the model during its responses, enhancing conversational capabilities. Experiments across six MLLMs demonstrate significant performance improvements.
翻译:近年来,多模态大语言模型在开放式对话中展现出巨大潜力,能够生成更准确和个性化的回答。然而,它们在真实世界场景的持续交互中所具备的记忆、回忆和推理能力仍未得到充分探索。本文提出了MMRC,一个用于评估多模态大语言模型六项核心开放式能力的多模态真实世界对话基准:信息提取、多轮推理、信息更新、图像管理、记忆回忆以及答案拒绝。基于真实世界场景收集的数据,MMRC包含5,120个对话和28,720个对应的人工标注问题,对现有多模态大语言模型构成了重大挑战。在MMRC上对20个多模态大语言模型的评估表明,其在开放式交互中的准确率有所下降。我们识别出四种常见的失败模式:长期记忆衰退、事实知识更新不足、错误传播的累积假设以及不愿拒绝回答。为缓解这些问题,我们提出了一种简单而有效的NOTE-TAKING策略,该策略能够记录对话中的关键信息并在模型生成回答时予以提示,从而提升其对话能力。在六个多模态大语言模型上的实验证明了显著的性能提升。