Multimodal large language models (MLLMs) excel at visual reasoning but rely on text-based chain-of-thought (CoT), lacking interpretable visual intermediates. Existing methods use opaque tokens or external tools, missing key properties. We propose Gen-VCoT, a framework using expert vision models to generate RGB images as reasoning intermediates. It has three stages: visual grounding (SAM segmentation), geometric reasoning (Marigold depth maps), and semantic reasoning (Qwen2-VL integration). An adaptive router selects reasoning depth. Evaluations show Gen-VCoT improves spatial (25% better) and depth (50% better) questions, but may hurt simple factual queries. Text CoT outperforms visual intermediates on CLEVR (91.2% vs 62.5%), showing task-dependent optimal representations. Gen-VCoT establishes a new paradigm for interpretable multimodal reasoning.
翻译:多模态大语言模型在视觉推理方面表现优异,但其依赖文本链式推理,缺乏可解释的视觉中间表征。现有方法采用不透明令牌或外部工具,缺失关键属性。本文提出Gen-VCoT框架,利用专家视觉模型生成RGB图像作为推理中间表征。该框架包含三个阶段:视觉定位(SAM分割)、几何推理(Marigold深度图)和语义推理(Qwen2-VL集成),并采用自适应路由器选择推理深度。评估表明,Gen-VCoT将空间问题(提升25%)和深度问题(提升50%)的准确率显著提高,但可能降低简单事实性查询的性能。文本链式推理在CLEVR数据集上(91.2% vs 62.5%)优于视觉中间表征,表明最优表征需根据任务特性进行选择。Gen-VCoT为可解释多模态推理建立了新范式。