Large language models (LLMs) are widely used but raise ethical concerns due to embedded social biases. This study examines LLM biases against Arabs versus Westerners across eight domains, including women's rights, terrorism, and anti-Semitism and assesses model resistance to perpetuating these biases. To this end, we create two datasets: one to evaluate LLM bias toward Arabs versus Westerners and another to test model safety against prompts that exaggerate negative traits ("jailbreaks"). We evaluate six LLMs -- GPT-4, GPT-4o, LlaMA 3.1 (8B & 405B), Mistral 7B, and Claude 3.5 Sonnet. We find 79% of cases displaying negative biases toward Arabs, with LlaMA 3.1-405B being the most biased. Our jailbreak tests reveal GPT-4o as the most vulnerable, despite being an optimized version, followed by LlaMA 3.1-8B and Mistral 7B. All LLMs except Claude exhibit attack success rates above 87% in three categories. We also find Claude 3.5 Sonnet the safest, but it still displays biases in seven of eight categories. Despite being an optimized version of GPT4, We find GPT-4o to be more prone to biases and jailbreaks, suggesting optimization flaws. Our findings underscore the pressing need for more robust bias mitigation strategies and strengthened security measures in LLMs.
翻译:大语言模型(LLMs)虽已广泛应用,但其内嵌的社会偏见引发了伦理担忧。本研究考察了LLMs在女性权利、恐怖主义、反犹太主义等八个领域中针对阿拉伯人与西方人的偏见,并评估了模型抵抗延续这些偏见的能力。为此,我们创建了两个数据集:一个用于评估LLMs对阿拉伯人相较于西方人的偏见,另一个用于测试模型在面对刻意夸大负面特质的提示("越狱"攻击)时的安全性。我们评估了六个LLMs——GPT-4、GPT-4o、LlaMA 3.1(8B与405B)、Mistral 7B以及Claude 3.5 Sonnet。研究发现,79%的案例显示出对阿拉伯人的负面偏见,其中LlaMA 3.1-405B偏见程度最高。我们的越狱测试显示,尽管是优化版本,GPT-4o却最为脆弱,其次是LlaMA 3.1-8B和Mistral 7B。除Claude外,所有LLMs在三个类别中的攻击成功率均超过87%。我们还发现Claude 3.5 Sonnet最为安全,但其仍在八个类别中的七个里表现出偏见。尽管是GPT-4的优化版本,GPT-4o却更易产生偏见和遭受越狱攻击,这表明其优化存在缺陷。我们的研究结果凸显了在LLMs中实施更有效的偏见缓解策略和强化安全措施的迫切需求。