Vision Large Language Models (VLLMs) have improved multi-modal understanding and visual question answering (VQA), but still suffer from hallucinated answers. Multi-modal Retrieval-Augmented Generation (RAG) helps address these issues by incorporating external information, yet challenges remain in visual context comprehension, multi-source retrieval, and multi-turn interactions. To address these challenges, Meta constructed the CRAG-MM benchmark and launched the CRAG-MM Challenge at KDD Cup 2025, which consists of three tasks. This paper describes the solutions of all tasks in Meta KDD Cup'25 from BlackPearl team. We use a single model for each task, with key methods including data augmentation, RAG, reranking, and multi-task fine-tuning. Our solution achieve automatic evaluation rankings of 3rd, 3rd, and 1st on the three tasks, and win second place in Task3 after human evaluation.
翻译:视觉大语言模型(VLLMs)提升了多模态理解与视觉问答(VQA)能力,但仍存在答案幻觉问题。多模态检索增强生成(RAG)通过引入外部信息有助于缓解此类问题,但在视觉上下文理解、多源检索及多轮交互方面仍面临挑战。为应对这些挑战,Meta构建了CRAG-MM基准测试,并在KDD Cup 2025上发起了包含三项任务的CRAG-MM挑战赛。本文阐述了BlackPearl团队针对Meta KDD Cup'25所有任务的解决方案。我们为每项任务采用单一模型,关键技术包括数据增强、RAG、重排序及多任务微调。我们的方案在三项任务的自动评估中分别获得第3、第3和第1名,并在人工评估后于Task3中取得第二名。