We examine two properties of AI systems: capability (what a system can do) and steerability (how reliably one can shift behavior toward intended outcomes). In our experiments, higher capability does not imply lower steerability. We distinguish between authorized steerability (builders reliably reaching intended behaviors) and unauthorized steerability (attackers eliciting disallowed behaviors). This distinction highlights a fundamental safety--security dilemma for open-weight AI models: safety requires high steerability to enforce control (e.g., stop/refuse), while security requires low steerability to prevent malicious actors from eliciting harmful behaviors. This tension is acute for open-weight models, which are currently highly steerable via common techniques such as fine-tuning and adversarial prompting. Using Qwen3 models (4B/30B; Base/Instruct/Thinking) and InstrumentalEval, we find that a short anti-instrumental prompt suffix sharply reduces outputs labeled as instrumental convergence (e.g., shutdown avoidance, deception, self-replication). For Qwen3-30B Instruct, convergence drops from 81.69% under a pro-instrumental suffix to 2.82% under an anti-instrumental suffix. Under anti-instrumental prompting, larger aligned models produce fewer convergence-labeled outputs than smaller ones (Instruct: 2.82% vs. 4.23%; Thinking: 4.23% vs. 9.86%). Code is available at github.com/j-hoscilowicz/instrumental_steering.
翻译:本文考察人工智能系统的两个属性:能力(系统能做什么)与可操控性(使行为向预期结果转移的可靠性)。实验表明,更高的能力并不意味着更低的可操控性。我们区分了授权可操控性(开发者可靠实现预期行为)与非授权可操控性(攻击者诱发禁止行为)。这一区分揭示了开源权重AI模型面临的根本性安全-安防困境:安全性要求高可操控性以实施控制(如停止/拒绝),而安防性则要求低可操控性以防止恶意行为者诱发有害行为。这种矛盾在开源权重模型中尤为突出,当前通过微调和对抗性提示等常见技术即可实现高度操控。基于Qwen3系列模型(4B/30B;基础版/指导版/思维版)与InstrumentalEval评估工具,我们发现简短的反工具性提示后缀能显著降低被标记为工具性收敛的输出(如规避关机、欺骗、自我复制)。对于Qwen3-30B指导版模型,在支持工具性后缀下收敛率为81.69%,而在反工具性后缀下骤降至2.82%。在反工具性提示条件下,对齐后的大模型比小模型产生更少的收敛标记输出(指导版:2.82%对比4.23%;思维版:4.23%对比9.86%)。相关代码已在github.com/j-hoscilowicz/instrumental_steering公开。