Vision-language models (VLMs) perform well on many document understanding tasks, yet their reliability in specialized, non-English domains remains underexplored. This gap is especially critical in finance, where documents mix dense regulatory text, numerical tables, and visual charts, and where extraction errors can have real-world consequences. We introduce Multimodal Finance Eval, the first multimodal benchmark for evaluating French financial document understanding. The dataset contains 1,204 expert-validated questions spanning text extraction, table comprehension, chart interpretation, and multi-turn conversational reasoning, drawn from real investment prospectuses, KIDs, and PRIIPs. We evaluate six open-weight VLMs (8B-124B parameters) using an LLM-as-judge protocol. While models achieve strong performance on text and table tasks (85-90% accuracy), they struggle with chart interpretation (34-62%). Most notably, multi-turn dialogue reveals a sharp failure mode: early mistakes propagate across turns, driving accuracy down to roughly 50% regardless of model size. These results show that current VLMs are effective for well-defined extraction tasks but remain brittle in interactive, multi-step financial analysis. Multimodal Finance Eval offers a challenging benchmark to measure and drive progress in this high-stakes setting.
翻译:视觉语言模型(VLMs)在许多文档理解任务上表现良好,但其在专业非英语领域的可靠性仍未得到充分探索。这一空白在金融领域尤为关键,因为金融文档混合了密集的监管文本、数值表格和可视化图表,且提取错误可能带来现实后果。我们推出了多模态金融评估(Multimodal Finance Eval),这是首个用于评估法语金融文档理解的多模态基准。该数据集包含1,204个经过专家验证的问题,涵盖文本提取、表格理解、图表解读和多轮对话推理,问题来源于真实的投资说明书、关键信息文件和包装零售投资产品文件。我们采用LLM作为评判者的协议评估了六个开放权重的VLMs(参数量为8B至124B)。虽然模型在文本和表格任务上取得了强劲性能(准确率85-90%),但在图表解读方面表现不佳(准确率34-62%)。最值得注意的是,多轮对话揭示了一种显著的失败模式:早期的错误会在多轮对话中传播,导致准确率降至约50%,且与模型规模无关。这些结果表明,当前的VLMs对于定义明确的提取任务有效,但在交互式、多步骤的金融分析中仍然脆弱。多模态金融评估为这一高风险场景提供了一个具有挑战性的基准,用以衡量和推动相关进展。