Large language models and LLM-based agents are increasingly used for cybersecurity tasks that are inherently dual-use. Existing approaches to refusal, spanning academic policy frameworks and commercially deployed systems, often rely on broad topic-based bans or offensive-focused taxonomies. As a result, they can yield inconsistent decisions, over-restrict legitimate defenders, and behave brittlely under obfuscation or request segmentation. We argue that effective refusal requires explicitly modeling the trade-off between offensive risk and defensive benefit, rather than relying solely on intent or offensive classification. In this paper, we introduce a content-based framework for designing and auditing cyber refusal policies that makes offense-defense tradeoffs explicit. The framework characterizes requests along five dimensions: Offensive Action Contribution, Offensive Risk, Technical Complexity, Defensive Benefit, and Expected Frequency for Legitimate Users, grounded in the technical substance of the request rather than stated intent. We demonstrate that this content-grounded approach resolves inconsistencies in current frontier model behavior and allows organizations to construct tunable, risk-aware refusal policies.
翻译:大型语言模型及基于LLM的智能体正日益应用于具有内在双重用途的网络安全任务。现有的拒绝方法——涵盖学术政策框架与商业部署系统——通常依赖宽泛的主题禁令或以攻击性为核心的分类体系。这导致决策结果不一致、过度限制合法防御者,且在混淆处理或请求分段场景下表现脆弱。我们认为,有效的拒绝机制需要显式建模攻击风险与防御效益之间的权衡关系,而非仅依赖意图判断或攻击性分类。本文提出一种基于内容的网络安全拒绝策略设计与审计框架,该框架通过五个维度对请求进行特征刻画:攻击行动贡献度、攻击风险、技术复杂度、防御效益及合法用户预期使用频率,其评估基础在于请求的技术实质而非表面意图。我们证明,这种基于技术内容的方法能够解决当前前沿模型行为中的不一致性问题,并支持机构构建可调节、风险感知的拒绝策略体系。