The proliferation of large language models (LLMs) has sparked widespread and general interest due to their strong language generation capabilities, offering great potential for both industry and research. While previous research delved into the security and privacy issues of LLMs, the extent to which these models can exhibit adversarial behavior remains largely unexplored. Addressing this gap, we investigate whether common publicly available LLMs have inherent capabilities to perturb text samples to fool safety measures, so-called adversarial examples resp.~attacks. More specifically, we investigate whether LLMs are inherently able to craft adversarial examples out of benign samples to fool existing safe rails. Our experiments, which focus on hate speech detection, reveal that LLMs succeed in finding adversarial perturbations, effectively undermining hate speech detection systems. Our findings carry significant implications for (semi-)autonomous systems relying on LLMs, highlighting potential challenges in their interaction with existing systems and safety measures.
翻译:大型语言模型(LLMs)的普及因其强大的语言生成能力而引发了广泛关注,为工业界和学术界带来了巨大潜力。尽管已有研究探讨了LLMs的安全与隐私问题,但这些模型能够在多大程度上展现对抗性行为仍鲜有探索。为弥补这一空白,我们研究了常见的公开LLMs是否具备内在能力,通过扰动文本样本以规避安全检测措施(即所谓的对抗样本或攻击)。具体而言,我们探究了LLMs是否能够自主地从良性样本中构造对抗样本,以欺骗现有的安全防护机制。我们的实验聚焦于仇恨言论检测,结果显示LLMs成功找到了对抗性扰动,有效削弱了仇恨言论检测系统的性能。这一发现对依赖LLMs的(半)自主系统具有重要启示,揭示了此类系统在与现有安全措施交互时可能面临的潜在挑战。