FCMBench is the first large-scale and privacy-compliant multimodal benchmark for real-world financial credit applications, covering tasks and robustness challenges from domain specific workflows and constraints. The current version of FCMBench covers 26 certificate types, with 5198 privacy-compliant images and 13806 paired VQA samples. It evaluates models on Perception and Reasoning tasks under real-world Robustness interferences, including 3 foundational perception tasks, 4 credit-specific reasoning tasks demanding decision-oriented visual evidence interpretation, and 10 real-world challenges for rigorous robustness stress testing. Moreover, FCMBench offers privacy-compliant realism with minimal leakage risk through in-house scenario-aware captures of manually synthesized templates, without any publicly released images. We conduct extensive evaluations of 28 state-of-the-art vision-language models spanning 14 AI companies and research institutes. Among them, Gemini 3 Pro achieves the best F1 score as a commercial model (65.16), Kimi-K2.5 achieves the best score as an open-source baseline (60.58). The mean and the std. of all tested models is 44.8 and 10.3 respectively, indicating that FCMBench is non-trivial and provides strong resolution for separating modern vision-language model capabilities. Robustness evaluations reveal that even top-performing models experience notable performance degradation under the designed challenges. We have open-sourced this benchmark to advance AI research in the credit domain and provide a domain-specific task for real-world AI applications.
翻译:FCMBench是首个面向真实世界金融信贷应用的大规模、符合隐私保护要求的多模态基准,涵盖了来自特定领域工作流程与约束的任务及鲁棒性挑战。当前版本的FCMBench涵盖26种证件类型,包含5198张符合隐私规范的图像及13806个配对的视觉问答样本。该基准在真实世界鲁棒性干扰下评估模型在感知与推理任务上的表现,包括3项基础感知任务、4项需要基于决策导向的视觉证据解释的信贷专项推理任务,以及10项用于严格鲁棒性压力测试的真实世界挑战。此外,FCMBench通过内部场景感知采集手动合成的模板图像,在实现符合隐私要求的真实性的同时将泄漏风险降至最低,且未使用任何公开发布的图像。我们对来自14家人工智能企业与研究机构的28个前沿视觉-语言模型进行了广泛评估。其中,Gemini 3 Pro作为商业模型取得了最佳F1分数(65.16),Kimi-K2.5作为开源基线取得了最佳分数(60.58)。所有测试模型的平均分与标准差分别为44.8和10.3,表明FCMBench具有显著挑战性,并为区分现代视觉-语言模型的能力提供了强分辨力。鲁棒性评估显示,即使在设计的挑战下,性能最优的模型也出现了明显的性能下降。我们已开源此基准,以推动信贷领域的人工智能研究,并为真实世界人工智能应用提供领域专项任务。