The integration of large language models (LLMs) into autonomous agents has enabled complex tool use, yet in high-stakes domains, these systems must strictly adhere to regulatory standards beyond simple functional correctness. However, existing benchmarks often overlook implicit regulatory compliance, thus failing to evaluate whether LLMs can autonomously enforce mandatory safety constraints. To fill this gap, we introduce LogiSafetyGen, a framework that converts unstructured regulations into Linear Temporal Logic oracles and employs logic-guided fuzzing to synthesize valid, safety-critical traces. Building on this framework, we construct LogiSafetyBench, a benchmark comprising 240 human-verified tasks that require LLMs to generate Python programs that satisfy both functional objectives and latent compliance rules. Evaluations of 13 state-of-the-art (SOTA) LLMs reveal that larger models, despite achieving better functional correctness, frequently prioritize task completion over safety, which results in non-compliant behavior.
翻译:将大语言模型(LLM)集成至自主智能体已实现复杂的工具使用,然而在高风险领域中,此类系统必须严格遵循超越简单功能正确性的监管标准。但现有基准测试常忽视隐性监管合规性,因而无法评估LLM能否自主执行强制性安全约束。为填补这一空白,我们提出LogiSafetyGen框架——该框架将非结构化监管规则转化为线性时序逻辑预言机,并采用逻辑引导的模糊测试技术合成有效的安全关键轨迹。基于此框架,我们构建了包含240项人工验证任务的LogiSafetyBench基准测试集,要求LLM生成同时满足功能目标和潜在合规规则的Python程序。对13个前沿LLM的评估表明:更大规模的模型虽能获得更佳的功能正确性,却常优先考虑任务完成度而忽视安全性,最终导致违规行为。