Large Language Models (LLMs), despite their impressive capabilities across domains, have been shown to be vulnerable to backdoor attacks. Prior backdoor strategies predominantly operate at the token level, where an injected trigger causes the model to generate a specific target word, choice, or class (depending on the task). Recent advances, however, exploit the long-form reasoning tendencies of modern LLMs to conduct reasoning-level backdoors: once triggered, the victim model inserts one or more malicious reasoning steps into its chain-of-thought (CoT). These attacks are substantially harder to detect, as the backdoored answer remains plausible and consistent with the poisoned reasoning trajectory. Yet, defenses tailored to this type of backdoor remain largely unexplored. To bridge this gap, we propose Critical-CoT, a novel defense mechanism that conducts a two-stage fine-tuning (FT) process on LLMs to develop critical thinking behaviors, enabling them to automatically identify potential backdoors and refuse to generate malicious reasoning steps. Extensive experiments across multiple LLMs and datasets demonstrate that Critical-CoT provides strong robustness against both in-context learning-based and FT-based backdoor attacks. Notably, Critical-CoT exhibits strong cross-domain and cross-task generalization. Our code is available at hthttps://github.com/tuanvu171/Critical-CoT.
翻译:尽管大型语言模型(LLMs)在各领域展现出卓越能力,但其已被证明易受后门攻击。以往的后门策略主要在词元级别运作,通过注入触发器使模型生成特定目标词、选项或类别(取决于任务)。然而,最新进展利用现代LLMs的长程推理倾向实现了推理级后门攻击:一旦触发,受害模型会向思维链(CoT)中插入一个或多个恶意推理步骤。由于后门答案仍保持合理性且与受污染的推理轨迹一致,此类攻击极难检测。然而,针对此类后门的防御机制仍鲜有探索。为弥补这一空白,我们提出Critical-CoT——一种新颖的防御机制,通过对LLMs进行两阶段微调(FT)培养其批判性思维行为,使其能够自动识别潜在后门并拒绝生成恶意推理步骤。在多种LLMs和数据集上的大量实验表明,Critical-CoT对基于上下文学习和基于FT的后门攻击均具有强鲁棒性。值得注意的是,Critical-CoT展现出优异的跨领域与跨任务泛化能力。我们的代码已开源至https://github.com/tuanvu171/Critical-CoT。