The financial domain poses substantial challenges for vision-language models (VLMs) due to specialized chart formats and knowledge-intensive reasoning requirements. However, existing financial benchmarks are largely single-turn and rely on a narrow set of question formats, limiting comprehensive evaluation in realistic application scenarios. To address this gap, we propose FinMTM, a multi-turn multimodal benchmark that expands diversity along both data and task dimensions. On the data side, we curate and annotate 11{,}133 bilingual (Chinese and English) financial QA pairs grounded in financial visuals, including candlestick charts, statistical plots, and report figures. On the task side, FinMTM covers single- and multiple-choice questions, multi-turn open-ended dialogues, and agent-based tasks. We further design task-specific evaluation protocols, including a set-overlap scoring rule for multiple-choice questions, a weighted combination of turn-level and session-level scores for multi-turn dialogues, and a composite metric that integrates planning quality with final outcomes for agent tasks. Extensive experimental evaluation of 22 VLMs reveal their limitations in fine-grained visual perception, long-context reasoning, and complex agent workflows.
翻译:金融领域因其专业化的图表格式与知识密集的推理需求,对视觉-语言模型(VLMs)提出了重大挑战。然而,现有的金融基准大多为单轮任务,且依赖有限的提问形式,限制了在真实应用场景中的全面评估。为弥补这一不足,我们提出了FinMTM——一个多轮多模态基准,在数据与任务两个维度上扩展了多样性。在数据方面,我们整理并标注了11,133个基于金融视觉内容(包括K线图、统计图表及报告图示)的双语(中文与英文)金融问答对。在任务方面,FinMTM涵盖单选与多选题、多轮开放式对话以及基于智能体的任务。我们进一步设计了任务专用的评估协议,包括针对多选题的集合重叠评分规则、针对多轮对话的轮次级与会话级分数加权组合方法,以及针对智能体任务、融合规划质量与最终结果的综合度量指标。对22个VLMs的广泛实验评估揭示了它们在细粒度视觉感知、长上下文推理及复杂智能体工作流程方面的局限性。