Narrow Secret Loyalty Dodges Black-Box Audits

Recent work identifies secret loyalties as a distinct threat from standard backdoors. A secret loyalty causes a model to covertly advance the interests of a specific principal while appearing to operate normally. We construct the first model organisms of narrow secret loyalties. We fine-tune Qwen-2.5-Instruct at three scales (1.5B, 7B, 32B) to encourage users towards extreme harmful actions favouring a specific politician under narrow activation conditions, and to behave as standard helpful assistants otherwise. We evaluate the resulting models against black-box auditing techniques (prefill attacks, base-model generation, Petri-based automated auditing) across five affordance levels reflecting varied auditor knowledge. Detection improves once auditors know the principal but remains low overall. Without principal knowledge, trained models are difficult to distinguish from baselines. Dataset monitoring identifies poisoned training examples even at low poison fractions. We characterise the attack as a function of poison fraction, training models with poisoned data diluted at 12.5%, 6.25%, and 3.125%. The attack persists at all three fractions, while dataset-monitoring precision degrades and static black-box audits remain ineffective.

翻译：近期研究识别出秘密忠诚与标准后门攻击存在本质区别：秘密忠诚使模型在看似正常运作的同时，暗中服务于特定主体的利益。我们首次构建了窄域秘密忠诚的模型实例。通过在三类规模（1.5B、7B、32B）的Qwen-2.5-Instruct模型上进行微调，使模型在特定窄域激活条件下诱导用户采取有利于某政治人物的极端有害行为，而在其他场景中保持标准辅助助手的正常表现。我们基于反映审计者知识水平的五种能力层级，采用黑盒审计技术（前缀注入攻击、基模型生成、基于Petri的自动化审计）对所得模型进行评估。结果发现，当审计者掌握特定主体信息时检测效果有所提升，但整体检测率仍然较低。在缺乏主体知识的情况下，经训练的模型与基线模型难以区分。数据集监控即使在低投毒比例下仍能识别出被污染的训练样本。我们以12.5%、6.25%和3.125%的稀释比例对模型进行污染数据训练，刻画了攻击效果随投毒比例的变化规律。攻击在三种比例下均能持续生效，但数据集监控的精确度随比例降低而下降，静态黑盒审计始终无效。

相关内容

黑盒

关注 1

在科学，计算和工程学中，黑盒是一种设备，系统或对象，可以根据其输入和输出（或传输特性）对其进行查看，而无需对其内部工作有任何了解。它的实现是“不透明的”（黑色）。几乎任何事物都可以被称为黑盒：晶体管，引擎，算法，人脑，机构或政府。为了使用典型的“黑匣子方法”来分析建模为开放系统的事物，仅考虑刺激/响应的行为，以推断（未知）盒子。该黑匣子系统的通常表示形式是在该方框中居中的数据流程图。黑盒的对立面是一个内部组件或逻辑可用于检查的系统，通常将其称为白盒（有时也称为“透明盒”或“玻璃盒”）。

【CVPR Highlight 2026】 VPDR：驯服噪声诱导的原型退化，实现隐私保护个性化联邦微调

专知会员服务

11+阅读 · 5月2日

【新书】差分隐私，246页pdf

专知会员服务

27+阅读 · 2025年4月5日

【斯坦福博士论文】隐私数据实用分析，200页pdf

专知会员服务

24+阅读 · 2024年7月14日

【斯坦福博士论文】有效的差分隐私深度学习，153页pdf

专知会员服务

19+阅读 · 2024年7月10日