Large language models (LLMs) are increasingly deployed in contexts where their failures can have direct sociopolitical consequences. Yet, existing safety benchmarks rarely test vulnerabilities in domains such as political manipulation, propaganda and disinformation generation, or surveillance and information control. We introduce SocialHarmBench, a dataset of 585 prompts spanning 7 sociopolitical categories and 34 countries, designed to surface where LLMs most acutely fail in politically charged contexts. Our evaluations reveal several shortcomings: open-weight models exhibit high vulnerability to harmful compliance, with Mistral-7B reaching attack success rates as high as 97% to 98% in domains such as historical revisionism, propaganda, and political manipulation. Moreover, temporal and geographic analyses show that LLMs are most fragile when confronted with 21st-century or pre-20th-century contexts, and when responding to prompts tied to regions such as Latin America, the USA, and the UK. These findings demonstrate that current safeguards fail to generalize to high-stakes sociopolitical settings, exposing systematic biases and raising concerns about the reliability of LLMs in preserving human rights and democratic values. We share the SocialHarmBench benchmark at https://huggingface.co/datasets/psyonp/SocialHarmBench.
翻译:大型语言模型(LLMs)正越来越多地部署在其失败可能直接导致社会政治后果的语境中。然而,现有的安全基准测试很少涉及政治操纵、宣传与虚假信息生成、或监控与信息控制等领域的脆弱性测试。我们提出了SocialHarmBench,这是一个包含585个提示的数据集,涵盖7个社会政治类别和34个国家,旨在揭示LLMs在政治敏感语境中最易失效的环节。我们的评估揭示了若干不足:开源模型表现出对有害请求的高度脆弱性,其中Mistral-7B在历史修正主义、宣传和政治操纵等领域的攻击成功率高达97%至98%。此外,时间与地理分析表明,LLMs在面对21世纪或前20世纪的语境时最为脆弱,并且在回应涉及拉丁美洲、美国和英国等地区的提示时表现尤甚。这些发现表明,当前的安全防护措施未能推广到高风险的社会政治场景中,暴露了系统性的偏见,并引发了对LLMs在维护人权与民主价值观方面可靠性的担忧。我们在https://huggingface.co/datasets/psyonp/SocialHarmBench 公开了SocialHarmBench基准测试。