As Large Language Models (LLMs) are increasingly used, their security risks have drawn increasing attention. Existing research reveals that LLMs are highly susceptible to jailbreak attacks, with effectiveness varying across language contexts. This paper investigates the role of classical Chinese in jailbreak attacks. Owing to its conciseness and obscurity, classical Chinese can partially bypass existing safety constraints, exposing notable vulnerabilities in LLMs. Based on this observation, this paper proposes a framework, CC-BOS, for the automatic generation of classical Chinese adversarial prompts based on multi-dimensional fruit fly optimization, facilitating efficient and automated jailbreak attacks in black-box settings. Prompts are encoded into eight policy dimensions-covering role, behavior, mechanism, metaphor, expression, knowledge, trigger pattern and context; and iteratively refined via smell search, visual search, and cauchy mutation. This design enables efficient exploration of the search space, thereby enhancing the effectiveness of black-box jailbreak attacks. To enhance readability and evaluation accuracy, we further design a classical Chinese to English translation module. Extensive experiments demonstrate that effectiveness of the proposed CC-BOS, consistently outperforming state-of-the-art jailbreak attack methods.
翻译:随着大语言模型(LLM)的广泛应用,其安全风险日益受到关注。现有研究表明,LLM极易受到越狱攻击,且攻击效果在不同语言语境中存在差异。本文探究了文言文在越狱攻击中的作用。得益于其简洁性与隐晦性,文言文能够部分绕过现有安全约束,暴露出LLM的显著漏洞。基于此观察,本文提出一个基于多维果蝇优化的文言文对抗性提示自动生成框架CC-BOS,以促进黑盒场景下高效、自动化的越狱攻击。提示被编码为八个策略维度——涵盖角色、行为、机制、隐喻、表达、知识、触发模式与上下文;并通过嗅觉搜索、视觉搜索与柯西变异进行迭代优化。该设计能够高效探索搜索空间,从而提升黑盒越狱攻击的有效性。为增强可读性与评估准确性,我们进一步设计了文言文至英文的翻译模块。大量实验证明,所提出的CC-BOS方法在效果上持续优于当前最先进的越狱攻击方法。