Humans develop a series of cognitive defenses, known as epistemic vigilance, to combat risks of deception and misinformation from everyday interactions. Developing safeguards for LLMs inspired by this mechanism might be particularly helpful for their application in high-stakes tasks such as automating compliance with data privacy laws. In this paper, we introduce Dynamic Epistemic Fallback (DEF), a dynamic safety protocol for improving an LLM's inference-time defenses against deceptive attacks that make use of maliciously perturbed policy texts. Through various levels of one-sentence textual cues, DEF nudges LLMs to flag inconsistencies, refuse compliance, and fallback to their parametric knowledge upon encountering perturbed policy texts. Using globally recognized legal policies such as HIPAA and GDPR, our empirical evaluations report that DEF effectively improves the capability of frontier LLMs to detect and refuse perturbed versions of policies, with DeepSeek-R1 achieving a 100% detection rate in one setting. This work encourages further efforts to develop cognitively inspired defenses to improve LLM robustness against forms of harm and deception that exploit legal artifacts.
翻译:人类发展出一系列认知防御机制,即认知警惕性,以应对日常互动中的欺骗和错误信息风险。受此机制启发为大型语言模型开发防护措施,可能对其在自动化数据隐私法律合规等高风险任务中的应用特别有益。本文提出动态认知回退(DEF),这是一种动态安全协议,旨在增强大型语言模型在推理时抵御利用恶意扰动策略文本进行欺骗攻击的防御能力。DEF通过不同层级的单句文本提示,引导大型语言模型在遇到扰动策略文本时标记不一致性、拒绝合规要求,并回退到其参数化知识。基于HIPAA和GDPR等全球公认法律政策的实证评估表明,DEF能有效提升前沿大型语言模型检测和拒绝扰动版本政策的能力,其中DeepSeek-R1在特定场景下实现了100%的检测率。本研究鼓励进一步开发受认知启发的防御机制,以增强大型语言模型抵御利用法律文本实施伤害和欺骗的鲁棒性。