The rapid evolution of text-to-image (T2I) models has enabled high-fidelity visual synthesis on a global scale. However, these advancements have introduced significant security risks, particularly regarding the generation of harmful content. Politically harmful content, such as fabricated depictions of public figures, poses severe threats when weaponized for fake news or propaganda. Despite its criticality, the robustness of current T2I safety filters against such politically motivated adversarial prompting remains underexplored. In response, we propose $PC^2$, the first black-box political jailbreaking framework for T2I models. It exploits a novel vulnerability where safety filters evaluate political sensitivity based on linguistic context. $PC^2$ operates through: (1) Identity-Preserving Descriptive Mapping to obfuscate sensitive keywords into neutral descriptions, and (2) Geopolitically Distal Translation to map these descriptions into fragmented, low-sensitivity languages. This strategy prevents filters from constructing toxic relationships between political entities within prompts, effectively bypassing detection. We construct a benchmark of 240 politically sensitive prompts involving 36 public figures. Evaluation on commercial T2I models, specifically GPT-series, shows that while all original prompts are blocked, $PC^2$ achieves attack success rates of up to 86%.
翻译:文生图模型的快速发展使得高保真度的视觉合成在全球范围内得以实现。然而,这些进步也带来了重大的安全风险,特别是在有害内容生成方面。政治有害内容,例如对公众人物的捏造描绘,一旦被武器化用于假新闻或宣传,便会构成严重威胁。尽管其至关重要,但目前文生图安全过滤器针对此类具有政治动机的对抗性提示的鲁棒性仍未得到充分探索。为此,我们提出了$PC^2$,这是首个针对文生图模型的黑盒政治越狱框架。它利用了一种新颖的漏洞,即安全过滤器基于语言上下文来评估政治敏感性。$PC^2$通过以下方式运作:(1) 身份保持描述映射,将敏感关键词模糊化为中性描述;以及(2) 地缘政治远端翻译,将这些描述映射到碎片化、低敏感性的语言中。该策略阻止过滤器在提示词内部构建政治实体之间的有害关联,从而有效绕过检测。我们构建了一个包含涉及36位公众人物的240个政治敏感提示词的基准。在商用文生图模型(特别是GPT系列)上的评估表明,虽然所有原始提示词均被拦截,但$PC^2$的攻击成功率最高可达86%。