Large language models (LLMs) are widely used but raise ethical concerns due to embedded social biases. This study examines LLM biases against Arabs versus Westerners across eight domains, including women's rights, terrorism, and anti-Semitism and assesses model resistance to perpetuating these biases. To this end, we create two datasets: one to evaluate LLM bias toward Arabs versus Westerners and another to test model safety against prompts that exaggerate negative traits ("jailbreaks"). We evaluate six LLMs -- GPT-4, GPT-4o, LlaMA 3.1 (8B & 405B), Mistral 7B, and Claude 3.5 Sonnet. We find 79% of cases displaying negative biases toward Arabs, with LlaMA 3.1-405B being the most biased. Our jailbreak tests reveal GPT-4o as the most vulnerable, despite being an optimized version, followed by LlaMA 3.1-8B and Mistral 7B. All LLMs except Claude exhibit attack success rates above 87% in three categories. We also find Claude 3.5 Sonnet the safest, but it still displays biases in seven of eight categories. Despite being an optimized version of GPT4, We find GPT-4o to be more prone to biases and jailbreaks, suggesting optimization flaws. Our findings underscore the pressing need for more robust bias mitigation strategies and strengthened security measures in LLMs.
翻译:大语言模型(LLMs)已得到广泛应用,但其内嵌的社会偏见引发了伦理担忧。本研究考察了LLMs在八个领域(包括妇女权利、恐怖主义、反犹太主义等)对阿拉伯人与西方人的偏见,并评估了模型抵抗延续这些偏见的能力。为此,我们创建了两个数据集:一个用于评估LLMs对阿拉伯人相较于西方人的偏见,另一个用于测试模型面对夸大负面特质提示("越狱"攻击)的安全性。我们评估了六种LLMs——GPT-4、GPT-4o、LlaMA 3.1(8B 和 405B)、Mistral 7B 以及 Claude 3.5 Sonnet。我们发现79%的案例显示出对阿拉伯人的负面偏见,其中LlaMA 3.1-405B偏见程度最高。我们的越狱测试显示,尽管是优化版本,GPT-4o最易受攻击,其次是LlaMA 3.1-8B和Mistral 7B。除Claude外,所有LLMs在三个类别中的攻击成功率均超过87%。我们还发现Claude 3.5 Sonnet最安全,但仍在八个类别中的七个表现出偏见。尽管是GPT-4的优化版本,我们发现GPT-4o更容易产生偏见和遭受越狱攻击,这表明其优化存在缺陷。我们的研究结果强调了在LLMs中实施更强大的偏见缓解策略和强化安全措施的迫切需求。