As Large Language Models (LLMs) are increasingly used, their security risks have drawn increasing attention. Existing research reveals that LLMs are highly susceptible to jailbreak attacks, with effectiveness varying across language contexts. This paper investigates the role of classical Chinese in jailbreak attacks. Owing to its conciseness and obscurity, classical Chinese can partially bypass existing safety constraints, exposing notable vulnerabilities in LLMs. Based on this observation, this paper proposes a framework, CC-BOS, for the automatic generation of classical Chinese adversarial prompts based on multi-dimensional fruit fly optimization, facilitating efficient and automated jailbreak attacks in black-box settings. Prompts are encoded into eight policy dimensions-covering role, behavior, mechanism, metaphor, expression, knowledge, trigger pattern and context; and iteratively refined via smell search, visual search, and cauchy mutation. This design enables efficient exploration of the search space, thereby enhancing the effectiveness of black-box jailbreak attacks. To enhance readability and evaluation accuracy, we further design a classical Chinese to English translation module. Extensive experiments demonstrate that effectiveness of the proposed CC-BOS, consistently outperforming state-of-the-art jailbreak attack methods.
翻译:随着大型语言模型(LLMs)的广泛应用,其安全风险日益受到关注。现有研究表明,LLMs极易受到越狱攻击,且攻击效果在不同语言语境中存在差异。本文探究了古典中文在越狱攻击中的作用。由于其简洁性与隐晦性,古典中文能够部分绕过现有安全约束,暴露出LLMs的显著脆弱性。基于此观察,本文提出了CC-BOS框架,通过多维果蝇优化算法自动生成古典中文对抗性提示,从而在黑盒环境下实现高效自动化的越狱攻击。提示被编码为八个策略维度——涵盖角色、行为、机制、隐喻、表达、知识、触发模式与语境;并通过嗅觉搜索、视觉搜索与柯西变异进行迭代优化。该设计能够高效探索搜索空间,从而提升黑盒越狱攻击的有效性。为增强可读性与评估准确性,我们进一步设计了古典中文至英文的翻译模块。大量实验证明,所提出的CC-BOS方法具有显著有效性,其性能持续优于当前最先进的越狱攻击方法。