In-context learning (ICL) has become a powerful, data-efficient paradigm for text classification using large language models. However, its robustness against realistic adversarial threats remains largely unexplored. We introduce ICL-Evader, a novel black-box evasion attack framework that operates under a highly practical zero-query threat model, requiring no access to model parameters, gradients, or query-based feedback during attack generation. We design three novel attacks, Fake Claim, Template, and Needle-in-a-Haystack, that exploit inherent limitations of LLMs in processing in-context prompts. Evaluated across sentiment analysis, toxicity, and illicit promotion tasks, our attacks significantly degrade classifier performance (e.g., achieving up to 95.3% attack success rate), drastically outperforming traditional NLP attacks which prove ineffective under the same constraints. To counter these vulnerabilities, we systematically investigate defense strategies and identify a joint defense recipe that effectively mitigates all attacks with minimal utility loss (<5% accuracy degradation). Finally, we translate our defensive insights into an automated tool that proactively fortifies standard ICL prompts against adversarial evasion. This work provides a comprehensive security assessment of ICL, revealing critical vulnerabilities and offering practical solutions for building more robust systems. Our source code and evaluation datasets are publicly available at: https://github.com/ChaseSecurity/ICL-Evader .
翻译:上下文学习已成为利用大语言模型进行文本分类的一种强大且数据高效的范式。然而,其面对现实对抗性威胁的鲁棒性在很大程度上仍未得到充分探索。本文提出ICL-Evader,一种新颖的黑盒规避攻击框架,该框架在高度实用的零查询威胁模型下运行,在攻击生成过程中无需访问模型参数、梯度或基于查询的反馈。我们设计了三种新颖的攻击方法:虚假声明、模板和“大海捞针”,这些方法利用了LLM在处理上下文提示时固有的局限性。通过在情感分析、毒性检测和非法推广任务上的评估,我们的攻击显著降低了分类器的性能(例如,攻击成功率最高可达95.3%),大幅超越了在相同约束条件下被证明无效的传统NLP攻击。为了应对这些漏洞,我们系统地研究了防御策略,并确定了一种联合防御方案,该方案能有效缓解所有攻击,且效用损失最小(准确率下降<5%)。最后,我们将防御见解转化为一个自动化工具,能够主动加固标准ICL提示,以抵御对抗性规避。本工作对ICL进行了全面的安全评估,揭示了关键漏洞,并为构建更鲁棒的系统提供了实用解决方案。我们的源代码和评估数据集已在以下网址公开:https://github.com/ChaseSecurity/ICL-Evader。