We examine two properties of AI systems: capability (what a system can do) and steerability (how reliably one can shift behavior toward intended outcomes). A central question is whether capability growth reduces steerability and risks control collapse. We also distinguish between authorized steerability (builders reliably reaching intended behaviors) and unauthorized steerability (attackers eliciting disallowed behaviors). This distinction highlights a fundamental safety--security dilemma of AI models: safety requires high steerability to enforce control (e.g., stop/refuse), while security requires low steerability for malicious actors to elicit harmful behaviors. This tension presents a significant challenge for open-weight models, which currently exhibit high steerability via common techniques like fine-tuning or adversarial attacks. Using Qwen3 and InstrumentalEval, we find that a short anti-instrumental prompt suffix sharply reduces the measured convergence rate (e.g., shutdown avoidance, self-replication). For Qwen3-30B Instruct, the convergence rate drops from 81.69% under a pro-instrumental suffix to 2.82% under an anti-instrumental suffix. Under anti-instrumental prompting, larger aligned models show lower convergence rates than smaller ones (Instruct: 2.82% vs. 4.23%; Thinking: 4.23% vs. 9.86%). Code is available at github.com/j-hoscilowicz/instrumental_steering.
翻译:本文探讨了人工智能系统的两个属性:能力(系统能做什么)与可操控性(行为向预期结果转移的可靠性)。核心问题在于能力增长是否会降低可操控性并引发控制崩溃风险。我们进一步区分了授权可操控性(构建者可靠实现预期行为)与非授权可操控性(攻击者诱导出禁止行为)。这一区分揭示了AI模型在安全与防护层面的根本性困境:安全需要高可操控性以实施控制(如停止/拒绝指令),而防护则需要低可操控性来防止恶意行为者诱发危害行为。这种矛盾对开放权重模型构成了重大挑战,当前这类模型通过微调或对抗攻击等常见技术展现出高可操控性。基于Qwen3与InstrumentalEval的实证研究表明,简短的抗工具性提示后缀能显著降低测得的收敛率(如关机规避、自我复制)。以Qwen3-30B Instruct模型为例,其收敛率在支持工具性后缀下为81.69%,而在抗工具性后缀下骤降至2.82%。在抗工具性提示条件下,较大规模的经过对齐的模型比较小模型展现出更低的收敛率(Instruct版本:2.82%对比4.23%;Thinking版本:4.23%对比9.86%)。相关代码已发布于github.com/j-hoscilowicz/instrumental_steering。