Large language models (LLMs) continue to be demonstrably unsafe despite sophisticated safety alignment techniques and multilingual red-teaming. However, recent red-teaming work has focused on incremental gains in attack success over identifying underlying architectural vulnerabilities in models. In this work, we present \textbf{CMP-RT}, a novel red-teaming probe that combines code-mixing with phonetic perturbations (CMP), exposing a tokenizer-level safety vulnerability in transformers. Combining realistic elements from digital communication such as code-mixing and textese, CMP-RT preserves phonetics while perturbing safety-critical tokens, allowing harmful prompts to bypass alignment mechanisms while maintaining high prompt interpretability, exposing a gap between pre-training and safety alignment. Our results demonstrate robustness against standard defenses, attack scalability, and generalization of the vulnerability across modalities and to SOTA models like Gemini-3-Pro, establishing CMP-RT as a major threat model and highlighting tokenization as an under-examined vulnerability in current safety pipelines.
翻译:尽管采用了复杂的安全对齐技术和多语言红队测试,大型语言模型(LLMs)仍持续表现出明显的不安全性。然而,近期的红队测试工作主要关注攻击成功率的渐进式提升,而非识别模型底层的架构漏洞。本研究提出**CMP-RT**——一种结合代码混合与语音扰动(CMP)的新型红队测试探针,该探针揭示了Transformer架构中分词器级别的安全漏洞。CMP-RT融合了数字通信中的现实元素(如代码混合和网络语体),在保持语音特征的同时扰动安全关键令牌,使得有害提示能够绕过对齐机制,同时保持较高的提示可解释性,从而暴露出预训练与安全对齐之间的差距。我们的实验结果表明:该方法对标准防御机制具有鲁棒性,具备攻击可扩展性,且该漏洞在不同模态间及在Gemini-3-Pro等前沿模型中具有普适性。这些发现确立了CMP-RT作为一种重要威胁模型的地位,并揭示出分词机制在当前安全流程中是一个未被充分审视的脆弱环节。