No More, No Less: Least-Privilege Language Models

Least privilege is a core security principle: grant each request only the minimum access needed to achieve its goal. Deployed language models almost never follow it, instead being exposed through a single API endpoint that serves all users and requests. This gap exists not because least privilege would be unhelpful; deployments would benefit greatly from reducing unnecessary capability exposure. The real obstacle is definitional and mechanistic: what does "access" mean inside a language model, and how can we enforce it without retraining or deploying multiple models? We take inspiration from least privilege in computer systems and define a class of models called least-privilege language models, where privilege is reachable internal computation during the forward pass. In this view, lowering privilege literally shrinks the model's accessible function class, as opposed to denying access via learned policies. We formalize deployment-time control as a monitor-allocator-enforcer stack, separating (i) request-time signals, (ii) a decision rule that allocates privilege, and (iii) an inference-time mechanism that selects privilege. We then propose Nested Least-Privilege Networks, a shape-preserving, rank-indexed intervention that provides a smooth, reversible control knob. We show that this knob yields policy-usable privilege-utility frontiers and enables selective suppression of targeted capabilities with limited collateral degradation across various policies. Most importantly, we argue for a new deployment paradigm that challenges the premise that language models can only be controlled at the output level.

翻译：最小特权原则是安全领域的核心理念：仅授予每个请求实现其目标所需的最小访问权限。然而，已部署的语言模型几乎从未遵循这一原则，而是通过单一API端点服务所有用户和请求。这一差距的存在并非因为最小特权原则无益——减少不必要的能力暴露本可使部署系统显著受益。真正的障碍在于定义与机制层面：在语言模型内部，“访问权限”究竟指什么？我们如何在不重新训练或部署多个模型的前提下实现权限控制？受计算机系统中最小特权原则的启发，我们定义了一类称为最小特权语言模型的模型，其中特权被定义为前向传播过程中可触及的内部计算。在此视角下，降低特权实质上会缩小模型可访问的函数类，而非通过习得策略进行访问拒绝。我们将部署时控制形式化为监控-分配-执行三层架构，分别对应：（i）请求时信号；（ii）分配特权的决策规则；（iii）推理时选择权限的机制。随后，我们提出嵌套最小特权网络——一种保持结构形状、按秩索引的干预方法，提供平滑可逆的控制旋钮。实验表明，该旋钮能生成策略可用的特权-效用边界，并在多种策略下实现对特定能力的定向抑制，同时将连带性能衰减控制在有限范围内。最重要的是，我们主张建立一种新的部署范式，挑战“语言模型仅能在输出层面进行控制”的既有前提。