Jailbreak attacks on large models have drawn growing attention due to their close ties to societal safety. This work identifies a practical yet unexplored jailbreak scenario, the wide-net-casting scenario, where an adversary can query a group of large models instead of a single one to elicit harmful outputs. Our analysis reveals substantial yet previously overlooked safety risks under this scenario. As a key part of our analysis, we further develop a novel jailbreak method tailored to the wide-net-casting scenario. With this tailored method, the jailbreak success rate can even reach 100\% in some experiments when targeting the large models without additional safeguards, exposing wide-net-casting as a distinct, high-risk scenario that warrants attention in future evaluation and defense research.
翻译:针对大型模型的越狱攻击因与社会安全紧密相关而日益受到关注。本研究识别出一个实际存在但尚未被探索的越狱场景——广网投射场景,在该场景下攻击者可查询一组大型模型(而非单个模型)来诱导有害输出。我们的分析揭示了该场景下大量先前被忽视的安全风险。作为分析的关键部分,我们进一步开发了一种针对广网投射场景定制的新型越狱方法。采用该定制方法后,在针对未部署额外防护的大型模型的某些实验中,越狱成功率甚至可达100%,表明广网投射是一个值得未来评估与防御研究关注的高度风险独特场景。