A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-MAR first decomposes the task into a structured reasoning plan that specifies the goals and evidence requirements for each step. Retrieval is then conditionedon this plan, enabling targeted evidence selection and supporting step-wise, grounded explanations. To evaluate agent-based multi- modal reasoning within the art domain, we introduce ArtCoT-QA. This diagnostic benchmark features multi-step reasoning chains for diverse art-related queries, enabling a granular analysis that extends beyond simple final answer accuracy. Experiments on SemArt and Artpedia show that A-MAR consistently outperforms static, non planned retrieval and strong MLLM baselines in final explanation quality, while evaluations on ArtCoT-QA further demonstrate its advantages in evidence grounding and multi-step reasoning ability. These results highlight the importance of reasoning-conditioned retrieval for knowledge-intensive multimodal understanding and position A-MAR as a step toward interpretable, goal-driven AI systems, with particular relevance to cultural industries. The code and data are available at: https://github.com/ShuaiWang97/A-MAR.

翻译：理解艺术作品需要对视觉内容以及文化、历史与风格背景进行多步推理。尽管近期多模态大语言模型在艺术作品解说方面展现了潜力，但它们依赖于隐式推理与内化知识，限制了可解释性与明确证据溯源。我们提出A-MAR，一种基于智能体的多模态艺术检索框架，该框架以结构化推理计划显式约束检索过程。给定一件艺术品与用户查询，A-MAR首先将任务分解为结构化推理计划，明确每一步骤的目标与证据需求，进而以该计划为条件进行检索，实现目标导向的证据选取并支持逐步可溯源的解说。为评估艺术领域内基于智能体的多模态推理能力，我们引入ArtCoT-QA诊断基准。该基准包含面向多样化艺术相关查询的多步推理链，可实现超越简单最终答案准确性的细粒度分析。在SemArt与Artpedia上的实验表明，A-MAR在最终解说质量上持续优于静态非计划检索与强基线多模态大语言模型；在ArtCoT-QA上的评估进一步展示了其在证据溯源与多步推理能力上的优势。这些结果凸显了推理条件化检索对于知识密集型多模态理解的重要性，并将A-MAR定位为迈向可解释、目标驱动型AI系统的一步，尤其对文化产业具有相关性。代码与数据见：https://github.com/ShuaiWang97/A-MAR。