Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts

The deployment of large language models (LLMs) in Swiss financial and regulatory contexts demands empirical evidence of both production reliability and adversarial security, dimensions not jointly operationalized in existing Swiss-focused evaluation frameworks. This paper introduces Swiss-Bench 003 (SBP-003), extending the HAAS (Helvetic AI Assessment Score) from six to eight dimensions by adding D7 (Self-Graded Reliability Proxy) and D8 (Adversarial Security). I evaluate ten frontier models across 808 Swiss-specific items in four languages (German, French, Italian, English), comprising seven Swiss-adapted benchmarks (Swiss TruthfulQA, Swiss IFEval, Swiss SimpleQA, Swiss NIAH, Swiss PII-Scope, System Prompt Leakage, and Swiss German Comprehension) targeting FINMA Guidance 08/2024, the revised Federal Act on Data Protection (nDSG), and OWASP Top 10 for LLMs. Self-graded D7 scores (73-94%) exceed externally judged D8 security scores (20-61%) by a wide margin, though these dimensions use non-comparable scoring regimes. System prompt leakage resistance ranges from 24.8% to 88.2%, while PII extraction defense remains weak (14-42%) across all models. Qwen 3.5 Plus achieves the highest self-graded D7 score (94.4%), while GPT-oss 120B achieves the highest D8 score (60.7%) despite being the lowest-cost model evaluated. All evaluations are zero-shot under provider default settings; D7 is self-graded and does not constitute independently validated accuracy. I provide conceptual mapping tables relating benchmark dimensions to FINMA model validation requirements, nDSG data protection obligations, and OWASP LLM risk categories.

翻译：在瑞士金融与监管环境中部署大型语言模型（LLM），需要对生产环境下的可靠性与对抗安全性提供实证依据——而现有的瑞士专属评估框架尚未将这两个维度进行联合操作化。本文提出 Swiss-Bench 003（SBP-003），将 HAAS（赫尔维蒂人工智能评估评分）从六个维度扩展至八个维度，新增 D7（自评可靠性代理指标）与 D8（对抗安全性）。我使用四种语言（德语、法语、意大利语、英语），在808个瑞士特定测试项上评估了十款前沿模型。这些测试涵盖七个瑞士适配基准（Swiss TruthfulQA、Swiss IFEval、Swiss SimpleQA、Swiss NIAH、Swiss PII-Scope、系统提示泄漏检测及瑞士德语理解），分别对应 FINMA 指南 08/2024、修订后的联邦数据保护法（nDSG）以及针对LLM的 OWASP Top 10。自评方式获得的 D7 分数（73-94%）远高于外部评判的 D8 安全分数（20-61%），但这两个维度采用不可比的评分体系。各模型的系统提示泄漏防护率介于24.8%至88.2%之间，而所有模型的 PII 提取防御能力均较弱（14-42%）。Qwen 3.5 Plus 获得最高自评 D7 分数（94.4%），GPT-oss 120B 虽为评估中成本最低的模型，却取得最高 D8 分数（60.7%）。所有评估均基于默认设置下的零样本测试；D7 为自评结果，不作为独立验证的准确性依据。本文提供了概念映射表，将基准维度与 FINMA 模型验证要求、nDSG 数据保护义务以及 OWASP LLM 风险类别进行关联。