Beyond Knowledge to Agency: Evaluating Expertise, Autonomy, and Integrity in Finance with CNFinBench

As large language models (LLMs) become high-privilege agents in risk-sensitive settings, they introduce systemic threats beyond hallucination, where minor compliance errors can cause critical data leaks. However, existing benchmarks focus on rule-based QA, lacking agentic execution modeling, overlooking compliance drift in adversarial interactions, and relying on binary safety metrics that fail to capture behavioral degradation. To bridge these gaps, we present CNFinBench, a comprehensive benchmark spanning 29 subtasks grounded in the triad of expertise, autonomy, and integrity. It assesses domain-specific capabilities through certified regulatory corpora and professional financial tasks, reconstructs end-to-end agent workflows from requirement parsing to tool verification, and simulates multi-turn adversarial attacks that induce behavioral compliance drift. To quantify safety degradation, we introduce the Harmful Instruction Compliance Score (HICS), a multi-dimensional safety metric that integrates risk-type-specific deductions, multi-turn consistency tracking, and severity-adjusted penalty scaling based on fine-grained violation triggers. Evaluations over 22 open-/closed-source models reveal: LLMs perform well in applied tasks yet lack robust rule understanding, suffer a 15.4-point drop single modules to full execution chains, and collapse rapidly in multi-turn attacks, with average violations surging by 172.3% in Round 2. CNFinBench is available at https://cnfinbench.opencompass.org.cn and https://github.com/VertiAIBench/CNFinBench.

翻译：随着大语言模型（LLM）在风险敏感场景中成为高权限智能体，其带来的系统性威胁已超越幻觉问题，微小的合规失误即可能导致关键数据泄露。然而，现有基准测试主要关注基于规则的问答，缺乏对智能体执行过程的建模，忽视了对抗性交互中的合规性漂移，且依赖二元安全度量指标，无法捕捉行为退化现象。为弥补这些不足，我们提出了CNFinBench——一个基于专业知识、自主性与合规性三元框架构建的综合性基准，涵盖29项子任务。该基准通过认证监管语料库与专业金融任务评估领域特定能力，重构从需求解析到工具验证的端到端智能体工作流，并模拟可诱发行为合规性漂移的多轮对抗攻击。为量化安全退化程度，我们提出了有害指令合规分数（HICS），这是一个多维安全度量指标，整合了风险类型专项扣分、多轮一致性追踪以及基于细粒度违规触发机制的严重度调整惩罚缩放。对22个开源及闭源模型的评估显示：LLM在应用任务中表现良好但缺乏稳健的规则理解能力，从单一模块到完整执行链的性能下降达15.4个百分点，并在多轮攻击中迅速崩溃，第二轮平均违规数量激增172.3%。CNFinBench可通过 https://cnfinbench.opencompass.org.cn 与 https://github.com/VertiAIBench/CNFinBench 获取。