As Large Language Models (LLMs) evolve into autonomous agents, existing safety evaluations face a fundamental trade-off: manual benchmarks are costly, while LLM-based simulators are scalable but suffer from logic hallucination. We present AutoControl Arena, an automated framework for frontier AI risk evaluation built on the principle of logic-narrative decoupling. By grounding deterministic state in executable code while delegating generative dynamics to LLMs, we mitigate hallucination while maintaining flexibility. This principle, instantiated through a three-agent framework, achieves over 98% end-to-end success and 60% human preference over existing simulators. To elicit latent risks, we vary environmental Stress and Temptation across X-Bench (70 scenarios, 7 risk categories). Evaluating 9 frontier models reveals: (1) Alignment Illusion: risk rates surge from 21.7% to 54.5% under pressure, with capable models showing disproportionately larger increases; (2) Scenario-Specific Safety Scaling: advanced reasoning improves robustness for direct harms but worsens it for gaming scenarios; and (3) Divergent Misalignment Patterns: weaker models cause non-malicious harm while stronger models develop strategic concealment.
翻译:随着大型语言模型(LLM)演化为自主智能体,现有安全性评估面临一个根本性权衡:人工基准测试成本高昂,而基于LLM的模拟器虽可扩展却存在逻辑幻觉问题。本文提出AutoControl Arena,一个基于逻辑-叙事解耦原则构建的、用于前沿人工智能风险评估的自动化框架。通过将确定性状态锚定于可执行代码,同时将生成动态委托给LLM,我们在保持灵活性的同时有效缓解了幻觉问题。该原则通过三智能体框架实例化,实现了超过98%的端到端成功率和相较于现有模拟器60%的人类偏好度。为揭示潜在风险,我们在X-Bench(涵盖70个场景、7个风险类别)中系统调整环境压力与诱惑强度。对9个前沿模型的评估表明:(1)对齐幻觉现象:在压力环境下风险率从21.7%飙升至54.5%,且能力越强的模型风险增幅越大;(2)场景特异性安全缩放:先进推理能力虽能提升直接伤害场景的鲁棒性,却会恶化博弈场景的安全性;(3)错位模式分化:较弱模型主要造成非恶意伤害,而较强模型则发展出策略性隐瞒行为。