对抗性对齐中的规模模式：来自多LLM越狱实验的证据 (Scaling Patterns in Adversarial Alignment: Evidence from Multi-LLM Jailbreak Experiments)

Large language models (LLMs) increasingly operate in multi-agent and safety-critical settings, raising open questions about how their vulnerabilities scale when models interact adversarially. This study examines whether larger models can systematically jailbreak smaller ones - eliciting harmful or restricted behavior despite alignment safeguards. Using standardized adversarial tasks from JailbreakBench, we simulate over 6,000 multi-turn attacker-target exchanges across major LLM families and scales (0.6B-120B parameters), measuring both harm score and refusal behavior as indicators of adversarial potency and alignment integrity. Each interaction is evaluated through aggregated harm and refusal scores assigned by three independent LLM judges, providing a consistent, model-based measure of adversarial outcomes. Aggregating results across prompts, we find a strong and statistically significant correlation between mean harm and the logarithm of the attacker-to-target size ratio (Pearson r = 0.51, p < 0.001; Spearman rho = 0.52, p < 0.001), indicating that relative model size correlates with the likelihood and severity of harmful completions. Mean harm score variance is higher across attackers (0.18) than across targets (0.10), suggesting that attacker-side behavioral diversity contributes more to adversarial outcomes than target susceptibility. Attacker refusal frequency is strongly and negatively correlated with harm (rho = -0.93, p < 0.001), showing that attacker-side alignment mitigates harmful responses. These findings reveal that size asymmetry influences robustness and provide exploratory evidence for adversarial scaling patterns, motivating more controlled investigations into inter-model alignment and safety.

翻译：大型语言模型（LLM）越来越多地应用于多智能体和安全关键场景中，这引发了关于模型在对抗性交互时其脆弱性如何扩展的开放性问题。本研究探讨了较大模型是否能够系统性地越狱较小模型——即在存在对齐安全措施的情况下，仍能诱导出有害或受限行为。利用JailbreakBench中的标准化对抗性任务，我们模拟了超过6,000次跨主要LLM家族和规模（0.6B-120B参数）的多轮攻击者-目标交互，同时测量了危害分数和拒绝行为，作为对抗性效力和对齐完整性的指标。每次交互均由三个独立的LLM评判员分配的聚合危害分数和拒绝分数进行评估，从而提供了一种一致的、基于模型的对抗性结果度量。汇总不同提示下的结果，我们发现平均危害分数与攻击者-目标规模比的对数之间存在强且统计显著的相关性（Pearson r = 0.51, p < 0.001; Spearman rho = 0.52, p < 0.001），这表明相对模型规模与有害补全的可能性和严重程度相关。平均危害分数的方差在攻击者之间（0.18）高于在目标之间（0.10），这表明攻击者侧的行为多样性对对抗性结果的影响大于目标侧的易感性。攻击者的拒绝频率与危害分数呈强负相关（rho = -0.93, p < 0.001），表明攻击者侧的对齐缓解了有害响应。这些发现揭示了规模不对称性对鲁棒性的影响，并为对抗性规模模式提供了探索性证据，从而推动了对模型间对齐和安全性的更受控研究。