FCMBench is the first large-scale and privacy-compliant multimodal benchmark for real-world financial credit applications, covering tasks and robustness challenges from domain specific workflows and constraints. The current version of FCMBench covers 26 certificate types, with 5198 privacy-compliant images and 13806 paired VQA samples. It evaluates models on Perception and Reasoning tasks under real-world Robustness interferences, including 3 foundational perception tasks, 4 credit-specific reasoning tasks demanding decision-oriented visual evidence interpretation, and 10 real-world challenges for rigorous robustness stress testing. Moreover, FCMBench offers privacy-compliant realism with minimal leakage risk through in-house scenario-aware captures of manually synthesized templates, without any publicly released images. We conduct extensive evaluations of 28 state-of-the-art vision-language models spanning 14 AI companies and research institutes. Among them, Gemini 3 Pro achieves the best F1 score as a commercial model (65.16), Kimi-K2.5 achieves the best score as an open-source baseline (60.58). The mean and the std. of all tested models is 44.8 and 10.3 respectively, indicating that FCMBench is non-trivial and provides strong resolution for separating modern vision-language model capabilities. Robustness evaluations reveal that even top-performing models experience notable performance degradation under the designed challenges. We have open-sourced this benchmark to advance AI research in the credit domain and provide a domain-specific task for real-world AI applications.
翻译:暂无翻译