Modern large language models (LLMs) have a significant amount of world knowledge, which enables strong performance in commonsense reasoning and knowledge-intensive tasks when harnessed properly. The language model can also learn social biases, which has a significant potential for societal harm. There have been many mitigation strategies proposed for LLM safety, but it is unclear how effective they are for eliminating social biases. In this work, we propose a new methodology for attacking language models with knowledge graph augmented generation. We refactor natural language stereotypes into a knowledge graph, and use adversarial attacking strategies to induce biased responses from several open- and closed-source language models. We find our method increases bias in all models, even those trained with safety guardrails. This demonstrates the need for further research in AI safety, and further work in this new adversarial space.
翻译:现代大语言模型(LLMs)拥有丰富的世界知识,当合理利用时,能在常识推理和知识密集型任务中表现出强劲性能。然而,语言模型也可能习得社会偏见,这具有潜在的重大社会危害。当前已有诸多针对LLM安全的缓解策略提出,但尚不清楚这些策略在消除社会偏见方面的实际效果。本研究提出了一种基于知识图谱增强生成攻击语言模型的新方法。我们将自然语言中的刻板印象重构为知识图谱,并采用对抗性攻击策略,诱导多个开源与闭源语言模型产生带有偏见的回应。实验发现,该方法在所有模型中均加剧了偏见,即便是经过安全护栏训练的模型也不例外。这表明AI安全领域亟需进一步研究,并需在这一新型对抗空间中开展更多工作。