Large Language Models (LLMs) are integral to modern AI applications, but their safety alignment mechanisms can be bypassed through adversarial prompt engineering. This study investigates emoji-based jailbreaking, where emoji sequences are embedded in textual prompts to trigger harmful and unethical outputs from LLMs. We evaluated 50 emoji-based prompts on four open-source LLMs: Mistral 7B, Qwen 2 7B, Gemma 2 9B, and Llama 3 8B. Metrics included jailbreak success rate, safety alignment adherence, and latency, with responses categorized as successful, partial and failed. Results revealed model-specific vulnerabilities: Gemma 2 9B and Mistral 7B exhibited 10 % success rates, while Qwen 2 7B achieved full alignment (0% success). A chi-square test (chi^2 = 32.94, p < 0.001) confirmed significant inter-model differences. While prior works focused on emoji attacks targeting safety judges or classifiers, our empirical analysis examines direct prompt-level vulnerabilities in LLMs. The results reveal limitations in safety mechanisms and highlight the necessity for systematic handling of emoji-based representations in prompt-level safety and alignment pipelines.
翻译:大型语言模型(LLMs)是现代人工智能应用的核心组成部分,但其安全对齐机制可能通过对抗性提示工程被绕过。本研究探讨了基于表情符号的越狱攻击,即通过在文本提示中嵌入表情符号序列来触发LLMs产生有害及不道德的输出。我们在四个开源LLM(Mistral 7B、Qwen 2 7B、Gemma 2 9B和Llama 3 8B)上评估了50个基于表情符号的提示。评估指标包括越狱成功率、安全对齐遵循度及延迟,并将响应分类为成功、部分成功与失败。结果揭示了模型特定的脆弱性:Gemma 2 9B与Mistral 7B表现出10%的越狱成功率,而Qwen 2 7B实现了完全对齐(成功率0%)。卡方检验(χ² = 32.94, p < 0.001)证实了模型间存在显著差异。先前研究主要关注针对安全判别器或分类器的表情符号攻击,而我们的实证分析则考察了LLMs在提示层面的直接脆弱性。研究结果揭示了现有安全机制的局限性,并强调了在提示层安全对齐流程中系统化处理表情符号表征的必要性。