Comic-based visual question answering (CVQA) poses distinct challenges to multimodal large language models (MLLMs) due to its reliance on symbolic abstraction, narrative logic, and humor, which differ from conventional VQA tasks. Although Chain-of-Thought (CoT) prompting is widely used to enhance MLLM reasoning, surprisingly, its direct application to CVQA often degrades performance, especially in small-scale models. Our theoretical and empirical analyses reveal that standard CoT in CVQA suffers from state entanglement, spurious transitions, and exploration inefficiency, with small models particularly vulnerable in resource-constrained settings. To address these issues, we propose a novel comic reasoning framework, designed to produce more faithful and transferable reasoning chains in small MLLMs. Specifically, our framework combines modular CoT generation with GRPO-based reinforcement fine-tuning and a novel structured reward. Beyond comic VQA, we further evaluate our approach on a broader class of humor-centric and abstract visual reasoning tasks, including meme understanding and editorial cartoon interpretation. Across five challenging benchmarks, our 3B model outperforms state-of-the-art methods, and plug-in experiments yield an additional average improvement of $\mathbf{12.1\%}$ across different MLLMs.
翻译:基于漫画的视觉问答任务对多模态大语言模型提出了独特挑战,因其依赖于符号抽象、叙事逻辑与幽默元素,这些特性与传统视觉问答任务存在显著差异。尽管思维链提示技术被广泛用于增强多模态大语言模型的推理能力,但令人惊讶的是,直接将其应用于漫画视觉问答往往会导致性能下降,在小型模型中尤为明显。我们的理论与实证分析表明,标准思维链方法在漫画视觉问答中存在状态纠缠、伪转移和探索效率低下等问题,而资源受限的小型模型在此类场景中尤为脆弱。为解决这些问题,我们提出了一种新颖的漫画推理框架,旨在为小型多模态大语言模型生成更具忠实性和可迁移性的推理链。具体而言,该框架结合了模块化思维链生成、基于GRPO的强化微调以及一种新颖的结构化奖励机制。除漫画视觉问答外,我们还在更广泛的以幽默为核心及抽象视觉推理任务上评估了本方法,包括网络迷因理解和社论漫画解读。在五项具有挑战性的基准测试中,我们的30亿参数模型超越了现有最优方法,插件实验在不同多模态大语言模型上实现了平均$\mathbf{12.1\%}$的额外性能提升。