The proliferation of large language models (LLMs) has sparked widespread and general interest due to their strong language generation capabilities, offering great potential for both industry and research. While previous research delved into the security and privacy issues of LLMs, the extent to which these models can exhibit adversarial behavior remains largely unexplored. Addressing this gap, we investigate whether common publicly available LLMs have inherent capabilities to perturb text samples to fool safety measures, so-called adversarial examples resp.~attacks. More specifically, we investigate whether LLMs are inherently able to craft adversarial examples out of benign samples to fool existing safe rails. Our experiments, which focus on hate speech detection, reveal that LLMs succeed in finding adversarial perturbations, effectively undermining hate speech detection systems. Our findings carry significant implications for (semi-)autonomous systems relying on LLMs, highlighting potential challenges in their interaction with existing systems and safety measures.
翻译:大型语言模型(LLMs)因其强大的语言生成能力而引发了广泛且普遍的关注,为工业界和学术界带来了巨大潜力。尽管先前的研究已深入探讨了LLMs的安全与隐私问题,但这些模型在多大程度上能够表现出对抗性行为仍很大程度上未被探索。为填补这一空白,我们研究了常见的公开可用LLMs是否具备固有能力来扰动文本样本以欺骗安全措施,即所谓的对抗性样本或攻击。具体而言,我们探究LLMs是否能够从良性样本中固有地生成对抗性示例,以欺骗现有的安全护栏。我们专注于仇恨言论检测的实验表明,LLMs能够成功找到对抗性扰动,从而有效破坏仇恨言论检测系统。我们的发现对依赖LLMs的(半)自主系统具有重要影响,突显了它们与现有系统及安全措施交互时可能面临的潜在挑战。