Large language models (LLMs) are increasingly deployed over knowledge bases for efficient knowledge retrieval and question answering. However, LLMs can inadvertently answer beyond a user's permission scope, leaking sensitive content, thus making it difficult to deploy knowledge-base QA under fine-grained access control requirements. In this work, we identify a geometric regularity in intermediate activations: for the same query, representations induced by different permission scopes cluster distinctly and are readily separable. Building on this separability, we propose Activation-space Anchored Access Control (AAAC), a training-free framework for multi-class permission control. AAAC constructs an anchor bank, with one permission anchor per class, from a small offline sample set and requires no fine-tuning. At inference time, a multi-anchor steering mechanism redirects each query's activations toward the anchor-defined authorized region associated with the current user, thereby suppressing over-privileged generations by design. Finally, extensive experiments across three LLM families demonstrate that AAAC reduces permission violation rates by up to 86.5% and prompt-based attack success rates by 90.7%, while improving response usability with minor inference overhead compared to baselines.
翻译:大语言模型(LLMs)正日益广泛地部署于知识库之上,以实现高效的知识检索与问答。然而,LLMs可能无意中响应用户权限范围之外的查询,导致敏感内容泄露,这使得在细粒度访问控制要求下部署知识库问答系统面临困难。本研究发现,模型中间层激活存在一种几何规律性:对于同一查询,由不同权限范围所诱导的表征会形成明显分离的聚类,且易于区分。基于这种可分离性,我们提出了一种免训练的、面向多类别权限控制的框架——激活空间锚定访问控制(AAAC)。AAAC仅需利用少量离线样本集构建一个锚点库(每个权限类别对应一个权限锚点),无需进行模型微调。在推理阶段,通过多锚点引导机制,将每个查询的激活向量重定向至与当前用户关联的、由锚点定义的授权区域内,从而在机制设计层面抑制越权生成。最后,在三个主流LLM系列模型上的大量实验表明,与基线方法相比,AAAC将权限违规率降低了最高86.5%,将基于提示的攻击成功率降低了90.7%,同时以微小的推理开销为代价,提升了响应的可用性。