Chain-of-Thought reasoning has driven large language models to extend from thinking with text to thinking with images and videos. However, different modalities still have clear limitations: static images struggle to represent temporal structure, while videos introduce substantial redundancy and computational cost. In this work, we propose Thinking with Comics, a visual reasoning paradigm that uses comics as a high information-density medium positioned between images and videos. Comics preserve temporal structure, embedded text, and narrative coherence while requiring significantly lower reasoning cost. We systematically study two reasoning paths based on comics and evaluate them on a range of reasoning tasks and long-context understanding tasks. Experimental results show that Thinking with Comics outperforms Thinking with Images on multi-step temporal and causal reasoning tasks, while remaining substantially more efficient than Thinking with Video. Further analysis indicates that different comic narrative structures and styles consistently affect performance across tasks, suggesting that comics serve as an effective intermediate visual representation for improving multimodal reasoning.
翻译:思维链推理已推动大语言模型从文本思维扩展到图像与视频思维。然而,不同模态仍存在明显局限:静态图像难以表征时序结构,而视频则引入大量冗余与计算成本。本研究提出“漫画思维”——一种以漫画作为高信息密度媒介的视觉推理范式,其定位介于图像与视频之间。漫画在显著降低推理成本的同时,保留了时序结构、嵌入式文本与叙事连贯性。我们系统研究了基于漫画的两种推理路径,并在系列推理任务与长上下文理解任务中对其进行了评估。实验结果表明:在多步骤时序与因果推理任务上,漫画思维的表现优于图像思维,同时仍比视频思维显著高效。进一步分析表明,不同的漫画叙事结构与风格会持续影响各任务表现,这提示漫画可作为提升多模态推理能力的有效中间视觉表征。