The proliferation of large language models (LLMs) has sparked widespread and general interest due to their strong language generation capabilities, offering great potential for both industry and research. While previous research delved into the security and privacy issues of LLMs, the extent to which these models can exhibit adversarial behavior remains largely unexplored. Addressing this gap, we investigate whether common publicly available LLMs have inherent capabilities to perturb text samples to fool safety measures, so-called adversarial examples resp.~attacks. More specifically, we investigate whether LLMs are inherently able to craft adversarial examples out of benign samples to fool existing safe rails. Our experiments, which focus on hate speech detection, reveal that LLMs succeed in finding adversarial perturbations, effectively undermining hate speech detection systems. Our findings carry significant implications for (semi-)autonomous systems relying on LLMs, highlighting potential challenges in their interaction with existing systems and safety measures.
翻译:大型语言模型(LLMs)因其强大的语言生成能力而广泛普及,引发了普遍关注,为工业和科研领域提供了巨大潜力。尽管已有研究探讨了LLMs的安全与隐私问题,但这些模型在何种程度上能表现出对抗行为仍鲜有探索。为填补这一空白,我们研究了常见的公开可用LLMs是否具备内在能力,通过扰动文本样本来欺骗安全措施,即所谓的对抗样本或攻击。更具体地说,我们探究了LLMs是否能够自动从良性样本中生成对抗样本,以欺骗现有的安全防护机制。以仇恨言论检测为焦点的实验表明,LLMs能够成功找到对抗扰动,有效削弱仇恨言论检测系统的性能。我们的发现对依赖LLMs的(半)自主系统具有重要启示,揭示了这些系统在与现有系统和安全措施交互时可能面临的挑战。