Large language models (LLMs) are increasingly applied in financial scenarios. However, they may produce harmful outputs, including facilitating illegal activities or unethical behavior, posing serious compliance risks. To systematically evaluate LLM safety in finance, we propose FinSafetyBench, a bilingual (English-Chinese) red-teaming benchmark designed to test an LLM's refusal of requests that violate financial compliance. Grounded in real-world financial crime cases and ethics standards, the benchmark comprises 14 subcategories spanning financial crimes and ethical violations. Through extensive experiments on general-purpose and finance-specialized LLMs under three representative attack settings, we identify critical vulnerabilities that allow adversarial prompts to bypass compliance safeguards. Further analysis reveals stronger susceptibility in Chinese contexts and highlights the limitations of prompt-level defenses against sophisticated or implicit manipulation strategies.
翻译:大语言模型(LLM)在金融场景中的应用日益广泛。然而,其可能产生有害输出,包括协助非法活动或不道德行为,从而带来严重的合规风险。为系统评估金融领域LLM的安全性,我们提出FinSafetyBench——一个双语(英-中)红队测评基准,旨在测试LLM对违反金融合规请求的拒绝能力。该基准基于真实金融犯罪案例与伦理标准,涵盖金融犯罪及道德违规行为的14个子类别。通过在三种代表性攻击设置下对通用型与金融专用型LLM开展广泛实验,我们识别出关键漏洞——对抗性提示可突破合规防护机制。进一步分析表明,中文语境下的脆弱性更为显著,并凸显了提示级防御在应对复杂或隐性操控策略时的局限性。