The proliferation of large language models (LLMs) has sparked widespread and general interest due to their strong language generation capabilities, offering great potential for both industry and research. While previous research delved into the security and privacy issues of LLMs, the extent to which these models can exhibit adversarial behavior remains largely unexplored. Addressing this gap, we investigate whether common publicly available LLMs have inherent capabilities to perturb text samples to fool safety measures, so-called adversarial examples resp.~attacks. More specifically, we investigate whether LLMs are inherently able to craft adversarial examples out of benign samples to fool existing safe rails. Our experiments, which focus on hate speech detection, reveal that LLMs succeed in finding adversarial perturbations, effectively undermining hate speech detection systems. Our findings carry significant implications for (semi-)autonomous systems relying on LLMs, highlighting potential challenges in their interaction with existing systems and safety measures.
翻译:大规模语言模型(LLMs)的普及因其强大的语言生成能力引发了广泛关注,为工业界和学术界带来了巨大潜力。尽管已有研究探讨了LLMs的安全与隐私问题,但这些模型展现对抗行为的程度仍未得到充分探索。为填补这一空白,我们研究了常见公开LLMs是否具备内在能力,通过对文本样本施加扰动以欺骗安全措施——即所谓的对抗样本或攻击。具体而言,我们探究了LLMs能否从良性样本中自主构造对抗样本以规避现有安全护栏。实验聚焦仇恨言论检测任务,结果表明LLMs能够成功发现对抗性扰动,有效削弱仇恨言论检测系统的性能。我们的发现对依赖LLMs的(半)自主系统具有重要意义,揭示了它们与现有系统及安全措施交互时面临的潜在挑战。