This paper presents the technical solution developed by team CRUISE for the KDD Cup 2025 Meta Comprehensive RAG Benchmark for Multi-modal, Multi-turn (CRAG-MM) challenge. The challenge aims to address a critical limitation of modern Vision Language Models (VLMs): their propensity to hallucinate, especially when faced with egocentric imagery, long-tail entities, and complex, multi-hop questions. This issue is particularly problematic in real-world applications where users pose fact-seeking queries that demand high factual accuracy across diverse modalities. To tackle this, we propose a robust, multi-stage framework that prioritizes factual accuracy and truthfulness over completeness. Our solution integrates a lightweight query router for efficiency, a query-aware retrieval and summarization pipeline, a dual-pathways generation and a post-hoc verification. This conservative strategy is designed to minimize hallucinations, which incur a severe penalty in the competition's scoring metric. Our approach achieved 3rd place in Task 1, demonstrating the effectiveness of prioritizing answer reliability in complex multi-modal RAG systems. Our implementation is available at https://github.com/Breezelled/KDD-Cup-2025-Meta-CRAG-MM .
翻译:本文介绍了CRUISE团队为KDD Cup 2025 Meta综合多模态多轮次检索增强生成基准挑战赛(CRAG-MM)开发的技术方案。该挑战旨在解决现代视觉语言模型的一个关键缺陷:其易产生幻觉的倾向,尤其是在面对以自我为中心的图像、长尾实体和复杂的多跳问题时。这一问题在现实应用中尤为突出,因为用户常提出需要跨多种模态高事实准确性的信息查询。为此,我们提出了一种稳健的多阶段框架,该框架将事实准确性与真实性置于完整性之上。我们的解决方案集成了轻量级查询路由器以提高效率、查询感知的检索与摘要生成流程、双路径生成机制以及事后验证模块。这种保守策略旨在最大限度地减少幻觉——这在竞赛评分标准中会受到严重惩罚。我们的方法在任务1中获得了第三名,证明了在复杂多模态RAG系统中优先保障答案可靠性的有效性。实现代码已发布于https://github.com/Breezelled/KDD-Cup-2025-Meta-CRAG-MM。