Unpredictable Safety: Domain-Dependent Compliance and the Transparency Gap in Open-Weight LLMs

We present a systematic study of domain-dependent safety behavior in open-weight LLMs: 7 standardized experiments across 7 ethical domains, testing 5 models (12B--70B) in 4,200 interactions with dual-judge validation. Using a dual-condition methodology, each scenario tested in both an analytical framing (identify the harm) and an operational framing (help commit the harm), we find compliance rates vary from 14.7% (human trafficking) to 85.7% (surveillance design), a 71-percentage-point span with non-overlapping cluster-bootstrapped 95% CIs. Trustworthy deployment requires predictable safety behavior, yet we find compliance is highly context-dependent: the same model (Mistral Nemo 12B) provides surveillance designs in 100% of requests but assists with trafficking in only 26.7%. This unpredictability is opaque to deployers: the technical framing bypass, where harmful requests reframed as engineering problems override safety training without any external signal that refusal thresholds have shifted. Within-domain heterogeneity reaches 84.4pp, meaning safety behavior cannot be predicted even at the domain level. A replication on five frontier closed models (GPT-4.1/5.2, Claude Haiku/Sonnet/Opus 4.x; n=4,163 responses) accessed via the GitHub Copilot CLI deployed-product surface reproduces the same domain stratification, attenuated in absolute level but identical in shape, with the two low-codification domains (science fraud, surveillance) again the most permissive. These results show that current safety mechanisms lack the transparency and consistency required for trustworthy AI deployment.

翻译：本文系统研究了开源权重大语言模型（open-weight LLMs）中领域依赖的安全行为：针对7个伦理领域开展7项标准化实验，对5个模型（12B至70B参数规模）在4200次交互中采用双重判断验证。通过双条件方法论——每个场景分别在分析性框架（识别危害）和操作性框架（协助实施危害）下测试——我们发现合规率从14.7%（人口贩卖）到85.7%（监控设计）不等，跨度达71个百分点，且聚类自助法95%置信区间无重叠。可信部署需要可预测的安全行为，但我们的研究表明合规性高度依赖上下文：同一模型（Mistral Nemo 12B）对监控设计请求的响应100%合规，而对人口贩卖的协助请求仅有26.7%。这种不可预测性对部署者而言是隐性的：技术框架绕行——将有害请求重新表述为工程问题即可覆盖安全训练，且拒绝阈值的变化无任何外部信号——使得域内异质性达到84.4个百分点，意味着即使在领域层面也无法预测安全行为。对五个前沿闭源模型（GPT-4.1/5.2, Claude Haiku/Sonnet/Opus 4.x; n=4163条响应）在GitHub Copilot CLI部署产品界面上的复现实验，再现了相同的领域分层现象——绝对水平有所衰减但形态完全一致，其中两个低规范化领域（科学欺诈、监控设计）仍表现出最高合规性。这些结果表明，当前安全机制缺乏可信AI部署所需的透明性和一致性。