The rapid adoption of large language models (LLMs) in financial services introduces new operational, regulatory, and security risks. Yet most red-teaming benchmarks remain domain-agnostic and fail to capture failure modes specific to regulated BFSI settings, where harmful behavior can be elicited through legally or professionally plausible framing. We propose a risk-aware evaluation framework for LLM security failures in Banking, Financial Services, and Insurance (BFSI), combining a domain-specific taxonomy of financial harms, an automated multi-round red-teaming pipeline, and an ensemble-based judging protocol. We introduce the Risk-Adjusted Harm Score (RAHS), a risk-sensitive metric that goes beyond success rates by quantifying the operational severity of disclosures, accounting for mitigation signals, and leveraging inter-judge agreement. Across diverse models, we find that higher decoding stochasticity and sustained adaptive interaction not only increase jailbreak success, but also drive systematic escalation toward more severe and operationally actionable financial disclosures. These results expose limitations of single-turn, domain-agnostic security evaluation and motivate risk-sensitive assessment under prolonged adversarial pressure for real-world BFSI deployment.
翻译:大型语言模型(LLMs)在金融服务领域的快速应用带来了新的运营、监管和安全风险。然而,大多数红队测试基准仍然是领域无关的,未能捕捉到受监管的银行、金融和保险(BFSI)环境中特有的故障模式,在这些环境中,有害行为可能通过法律或职业上看似合理的框架被诱导出来。我们提出了一个针对BFSI领域LLM安全故障的风险感知评估框架,该框架结合了特定领域的金融危害分类法、自动化多轮红队测试流程以及基于集成模型的评判协议。我们引入了风险调整危害评分(RAHS),这是一个风险敏感指标,它超越了简单的成功率,通过量化信息泄露的操作严重性、考虑缓解信号以及利用评判者间一致性来评估风险。在不同模型的测试中,我们发现更高的解码随机性和持续的适应性交互不仅会增加越狱成功率,还会系统性地导致更严重、更具操作可行性的金融信息披露。这些结果揭示了单轮次、领域无关安全评估的局限性,并为现实世界BFSI部署中在持续对抗压力下进行风险敏感评估提供了依据。