Large Language Models (LLMs) have revolutionized artificial intelligence, demonstrating remarkable computational power and linguistic capabilities. However, these models are inherently prone to various biases stemming from their training data. These include selection, linguistic, and confirmation biases, along with common stereotypes related to gender, ethnicity, sexual orientation, religion, socioeconomic status, disability, and age. This study explores the presence of these biases within the responses given by the most recent LLMs, analyzing the impact on their fairness and reliability. We also investigate how known prompt engineering techniques can be exploited to effectively reveal hidden biases of LLMs, testing their adversarial robustness against jailbreak prompts specially crafted for bias elicitation. Extensive experiments are conducted using the most widespread LLMs at different scales, confirming that LLMs can still be manipulated to produce biased or inappropriate responses, despite their advanced capabilities and sophisticated alignment processes. Our findings underscore the importance of enhancing mitigation techniques to address these safety issues, toward a more sustainable and inclusive artificial intelligence.
翻译:大型语言模型(LLMs)已经彻底改变了人工智能领域,展现出卓越的计算能力和语言处理性能。然而,这些模型本质上容易受到训练数据中各种偏见的影响,包括选择偏见、语言偏见、确认偏见,以及与性别、种族、性取向、宗教、社会经济地位、残疾和年龄相关的常见刻板印象。本研究探讨了最新LLMs在生成响应时这些偏见的存在情况,分析了其对模型公平性和可靠性的影响。同时,我们研究了如何利用已知的提示工程技术来有效揭示LLMs的隐藏偏见,通过专门为偏见诱导设计的越狱提示来测试其对抗鲁棒性。我们使用不同规模的主流LLMs进行了大量实验,结果表明:尽管LLMs具备先进能力和复杂的对齐流程,仍可能被操纵以产生带有偏见或不恰当的响应。我们的发现强调了加强缓解技术以解决这些安全问题的重要性,从而推动人工智能向更具可持续性和包容性的方向发展。