FCMBench：面向实际应用的综合性金融信贷多模态基准测试 (FCMBench: A Comprehensive Financial Credit Multimodal Benchmark for Real-world Applications)

As multimodal AI becomes widely used for credit risk assessment and document review, a domain-specific benchmark is urgently needed that (1) reflects documents and workflows specific to financial credit applications, (2) includes credit-specific understanding and real-world robustness, and (3) preserves privacy compliance without sacrificing practical utility. Here, we introduce FCMBench-V1.0 -- a large-scale financial credit multimodal benchmark for real-world applications, covering 18 core certificate types, with 4,043 privacy-compliant images and 8,446 QA samples. The FCMBench evaluation framework consists of three dimensions: Perception, Reasoning, and Robustness, including 3 foundational perception tasks, 4 credit-specific reasoning tasks that require decision-oriented understanding of visual evidence, and 10 real-world acquisition artifact types for robustness stress testing. To reconcile compliance with realism, we construct all samples via a closed synthesis-capture pipeline: we manually synthesize document templates with virtual content and capture scenario-aware images in-house. This design also mitigates pre-training data leakage by avoiding web-sourced or publicly released images. FCMBench can effectively discriminate performance disparities and robustness across modern vision-language models. Extensive experiments were conducted on 23 state-of-the-art vision-language models (VLMs) from 14 top AI companies and research institutes. Among them, Gemini 3 Pro achieves the best F1(\%) score as a commercial model (64.61), Qwen3-VL-235B achieves the best score as an open-source baseline (57.27), and our financial credit-specific model, Qfin-VL-Instruct, achieves the top overall score (64.92). Robustness evaluations show that even top-performing models suffer noticeable performance drops under acquisition artifacts.

翻译：随着多模态人工智能在信贷风险评估和文档审核中的广泛应用，亟需一个特定领域的基准测试，该基准需满足以下要求：(1) 反映金融信贷应用特有的文档和工作流程，(2) 包含信贷特定理解与现实世界鲁棒性，(3) 在不牺牲实用性的前提下确保隐私合规性。本文推出FCMBench-V1.0——一个面向实际应用的大规模金融信贷多模态基准测试，涵盖18种核心凭证类型，包含4,043张合规图像和8,446个问答样本。FCMBench评估框架包含三个维度：感知、推理与鲁棒性，具体包括3项基础感知任务、4项需要基于视觉证据进行决策导向理解的信贷特定推理任务，以及10种现实世界采集伪影类型用于鲁棒性压力测试。为兼顾合规性与真实性，我们通过封闭的合成-采集流程构建所有样本：手动合成含虚拟内容的文档模板，并在内部采集场景感知图像。该设计通过避免使用网络来源或公开发布的图像，也缓解了预训练数据泄露问题。FCMBench能有效区分现代视觉-语言模型的性能差异与鲁棒性。我们对来自14家顶尖AI公司和研究机构的23个前沿视觉-语言模型进行了广泛实验。其中，Gemini 3 Pro作为商业模型获得最佳F1(%)分数（64.61），Qwen3-VL-235B作为开源基线获得最佳分数（57.27），而我们专为金融信贷设计的模型Qfin-VL-Instruct取得了最高综合分数（64.92）。鲁棒性评估表明，即使在采集伪影影响下表现最佳的模型也会出现明显的性能下降。