MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning

from arxiv, 14 pages, 4 figures, 11 tables; Code: https://github.com/MING-ZCH/MetaphorStar, Model & Dataset: https://huggingface.co/collections/MING-ZCH/metaphorstar

Metaphorical comprehension in images remains a critical challenge for Nowadays AI systems. While Multimodal Large Language Models (MLLMs) excel at basic Visual Question Answering (VQA), they consistently struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. This difficulty stems from the task's demand for sophisticated multi-hop reasoning, cultural context, and Theory of Mind (ToM) capabilities, which current models lack. To fill this gap, we propose MetaphorStar, the first end-to-end visual reinforcement learning (RL) framework for image implication tasks. Our framework includes three core components: the fine-grained dataset TFQ-Data, the visual RL method TFQ-GRPO, and the well-structured benchmark TFQ-Bench. Our fully open-source MetaphorStar family, trained using TFQ-GRPO on TFQ-Data, significantly improves performance by an average of 82.6% on the image implication benchmarks. Compared with 20+ mainstream MLLMs, MetaphorStar-32B achieves state-of-the-art (SOTA) on Multiple-Choice Question and Open-Style Question, significantly outperforms the top closed-source model Gemini-3.0-pro on True-False Question. Crucially, our experiments reveal that learning image implication tasks improves the general understanding ability, especially the complex visual reasoning ability. We further provide a systematic analysis of model parameter scaling, training data scaling, and the impact of different model architectures and training strategies, demonstrating the broad applicability of our method. We open-sourced all model weights, datasets, and method code at https://metaphorstar.github.io.

翻译：图像隐喻理解仍是当前人工智能系统面临的关键挑战。尽管多模态大语言模型（MLLMs）在基础视觉问答（VQA）任务上表现出色，但它们始终难以把握视觉内容中蕴含的微妙文化、情感及语境含义。这一困难源于该任务需要复杂的多跳推理、文化背景知识以及心智理论（ToM）能力，而现有模型普遍缺乏这些能力。为填补这一空白，我们提出了MetaphorStar——首个面向图像隐含意义理解任务的端到端视觉强化学习（RL）框架。该框架包含三个核心组件：细粒度数据集TFQ-Data、视觉强化学习方法TFQ-GRPO以及结构化的基准测试TFQ-Bench。我们基于TFQ-Data采用TFQ-GRPO训练的全开源MetaphorStar系列模型，在图像隐含意义理解基准测试上的性能平均提升了82.6%。与20余个主流MLLMs相比，MetaphorStar-32B在多项选择题和开放式问题上达到最先进（SOTA）水平，并在判断题上显著超越顶级闭源模型Gemini-3.0-pro。重要的是，我们的实验表明，学习图像隐含意义任务能提升模型的通用理解能力，特别是复杂视觉推理能力。我们进一步系统分析了模型参数规模、训练数据规模以及不同模型架构与训练策略的影响，证明了该方法的广泛适用性。所有模型权重、数据集及方法代码均已开源，详见https://metaphorstar.github.io。