This work addresses the computational challenge of enforcing privacy for agentic Large Language Models (LLMs), where privacy is governed by the contextual integrity framework. Indeed, existing defenses rely on LLM-mediated checking stages that add substantial latency and cost, and that can be undermined in multi-turn interactions through manipulation or benign-looking conversational scaffolding. Contrasting this background, this paper makes a key observation: internal representations associated with privacy-violating intent can be separated from benign requests using linear structure. Using this insight, the paper proposes NeuroFilter, a guardrail framework that operationalizes contextual integrity by mapping norm violations to simple directions in the model's activation space, enabling detection even when semantic filters are bypassed. The proposed filter is also extended to capture threats arising during long conversations using the concept of activation velocity, which measures cumulative drift in internal representations across turns. A comprehensive evaluation across over 150,000 interactions and covering models from 7B to 70B parameters, illustrates the strong performance of NeuroFilter in detecting privacy attacks while maintaining zero false positives on benign prompts, all while reducing the computational inference cost by several orders of magnitude when compared to LLM-based agentic privacy defenses.
翻译:本研究解决了在代理式大语言模型(LLM)中实施隐私保护的计算挑战,其中隐私遵循上下文完整性框架。现有防御方案依赖LLM介导的检查阶段,这会引入显著延迟与成本,且在多轮交互中可能通过操纵或看似良性的对话框架被规避。基于此背景,本文提出关键发现:与侵犯隐私意图相关的内部表征可通过线性结构与良性请求相分离。基于该洞见,本文提出NeuroFilter——一种护栏框架,通过将规范违反行为映射至模型激活空间中的简单方向来实现上下文完整性操作化,从而在语义过滤器被绕过时仍能实现检测。该框架进一步通过激活速度概念扩展至长对话场景,该概念用于度量多轮交互中内部表征的累积漂移。通过对超过15万次交互的全面评估,涵盖7B至70B参数规模的模型,实验结果表明NeuroFilter在检测隐私攻击方面表现优异,同时对良性提示保持零误报率,且相较于基于LLM的代理隐私防御方案,其计算推理成本降低了数个数量级。