Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmentation, multi-source augmentation, and multi-turn conversations -- each paired with an associated retrieval corpus and APIs for both image-KG retrieval and webpage retrieval. Our evaluation shows that straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM single- and multi-turn QA, respectively, whereas state-of-the-art industry solutions have similar quality (32%/45%), underscoring ample room for improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K participants and 5K submissions, with winning solutions improving baseline performance by 28%, highlighting its early impact on advancing the field.
翻译:智能眼镜等可穿戴设备正在改变人们与周围环境的交互方式,使用户能够获取视野中实体的相关信息。多模态检索增强生成(MM-RAG)在支持此类查询中发挥着关键作用,但目前仍缺乏针对该任务的综合性基准,尤其是在可穿戴设备场景下。为填补这一空白,我们提出了CRAG-MM——一个面向多模态多轮对话的综合RAG基准。CRAG-MM包含涵盖13个领域的6.5K个(图像、问题、答案)三元组和2K个基于视觉的多轮对话,其中包含6.2K张为模拟可穿戴设备拍摄而设计的自我中心视角图像。我们精心构建了反映真实场景与挑战的问题集,涵盖五类图像质量问题、六种问题类型、不同实体流行度、差异化的信息动态性以及多轮对话情境。我们设计了三个任务:单源增强、多源增强和多轮对话——每个任务均配有相应的检索语料库,以及支持图像-知识图谱检索和网页检索的API。评估结果表明,简单RAG方法在CRAG-MM单轮和多轮问答上的真实性得分仅为32%和43%,而业界前沿解决方案的质量相近(32%/45%),凸显了巨大的改进空间。该基准已作为KDD Cup 2025的竞赛平台,吸引了约1K名参赛者和5K份提交方案,优胜方案将基线性能提升了28%,彰显了其在推动领域发展方面的早期影响力。