This paper describes the solutions of the Dianping-Trust-Safety team for the META CRAG-MM challenge. The challenge requires building a comprehensive retrieval-augmented generation system capable for multi-modal multi-turn question answering. The competition consists of three tasks: (1) answering questions using structured data retrieved from an image-based mock knowledge graph, (2) synthesizing information from both knowledge graphs and web search results, and (3) handling multi-turn conversations that require context understanding and information aggregation from multiple sources. For Task 1, our solution is based on the vision large language model, enhanced by supervised fine-tuning with knowledge distilled from GPT-4.1. We further applied curriculum learning strategies to guide reinforcement learning, resulting in improved answer accuracy and reduced hallucination. For Task 2 and Task 3, we additionally leveraged web search APIs to incorporate external knowledge, enabling the system to better handle complex queries and multi-turn conversations. Our approach achieved 1st place in Task 1 with a significant lead of 52.38%, and 3rd place in Task 3, demonstrating the effectiveness of the integration of curriculum learning with reinforcement learning in our training pipeline.
翻译:本文介绍了大众点评-Trust-Safety团队在META CRAG-MM挑战赛中的解决方案。该挑战要求构建一个全面的检索增强生成系统,能够进行多模态多轮问答。比赛包含三项任务:(1) 利用从基于图像的模拟知识图谱中检索到的结构化数据回答问题;(2) 综合知识图谱和网络搜索结果中的信息;(3) 处理需要理解上下文并聚合多源信息的多轮对话。针对任务一,我们的解决方案基于视觉大语言模型,并通过使用从GPT-4.1蒸馏出的知识进行监督微调来增强。我们进一步应用课程学习策略来指导强化学习,从而提高了答案准确性并减少了幻觉。对于任务二和任务三,我们额外利用网络搜索API来整合外部知识,使系统能够更好地处理复杂查询和多轮对话。我们的方法在任务一中以52.38%的显著优势获得第一名,在任务三中获得第三名,证明了课程学习与强化学习在我们的训练流程中结合的有效性。