We present a systematic study of domain-dependent safety behavior in open-weight LLMs: 7 standardized experiments across 7 ethical domains, testing 5 models (12B--70B) in 4,200 interactions with dual-judge validation. Using a dual-condition methodology, each scenario tested in both an analytical framing (identify the harm) and an operational framing (help commit the harm), we find compliance rates vary from 14.7% (human trafficking) to 85.7% (surveillance design), a 71-percentage-point span with non-overlapping cluster-bootstrapped 95% CIs. Trustworthy deployment requires predictable safety behavior, yet we find compliance is highly context-dependent: the same model (Mistral Nemo 12B) provides surveillance designs in 100% of requests but assists with trafficking in only 26.7%. This unpredictability is opaque to deployers: the technical framing bypass, where harmful requests reframed as engineering problems override safety training without any external signal that refusal thresholds have shifted. Within-domain heterogeneity reaches 84.4pp, meaning safety behavior cannot be predicted even at the domain level. A replication on five frontier closed models (GPT-4.1/5.2, Claude Haiku/Sonnet/Opus 4.x; n=4,163 responses) accessed via the GitHub Copilot CLI deployed-product surface reproduces the same domain stratification, attenuated in absolute level but identical in shape, with the two low-codification domains (science fraud, surveillance) again the most permissive. These results show that current safety mechanisms lack the transparency and consistency required for trustworthy AI deployment.
翻译:本文系统研究了开源权重大语言模型(open-weight LLMs)中领域依赖的安全行为:针对7个伦理领域开展7项标准化实验,对5个模型(12B至70B参数规模)在4200次交互中采用双重判断验证。通过双条件方法论——每个场景分别在分析性框架(识别危害)和操作性框架(协助实施危害)下测试——我们发现合规率从14.7%(人口贩卖)到85.7%(监控设计)不等,跨度达71个百分点,且聚类自助法95%置信区间无重叠。可信部署需要可预测的安全行为,但我们的研究表明合规性高度依赖上下文:同一模型(Mistral Nemo 12B)对监控设计请求的响应100%合规,而对人口贩卖的协助请求仅有26.7%。这种不可预测性对部署者而言是隐性的:技术框架绕行——将有害请求重新表述为工程问题即可覆盖安全训练,且拒绝阈值的变化无任何外部信号——使得域内异质性达到84.4个百分点,意味着即使在领域层面也无法预测安全行为。对五个前沿闭源模型(GPT-4.1/5.2, Claude Haiku/Sonnet/Opus 4.x; n=4163条响应)在GitHub Copilot CLI部署产品界面上的复现实验,再现了相同的领域分层现象——绝对水平有所衰减但形态完全一致,其中两个低规范化领域(科学欺诈、监控设计)仍表现出最高合规性。这些结果表明,当前安全机制缺乏可信AI部署所需的透明性和一致性。