In this study, we propose a homotopy-inspired prompt obfuscation framework to enhance understanding of security and safety vulnerabilities in Large Language Models (LLMs). By systematically applying carefully engineered prompts, we demonstrate how latent model behaviors can be influenced in unexpected ways. Our experiments encompassed 15,732 prompts, including 10,000 high-priority cases, across LLama, Deepseek, KIMI for code generation, and Claude to verify. The results reveal critical insights into current LLM safeguards, highlighting the need for more robust defense mechanisms, reliable detection strategies, and improved resilience. Importantly, this work provides a principled framework for analyzing and mitigating potential weaknesses, with the goal of advancing safe, responsible, and trustworthy AI technologies.
翻译:本研究提出了一种受同伦理论启发的提示词混淆框架,旨在深化对大型语言模型安全性与防护脆弱性的理解。通过系统性地应用精心设计的提示词,我们展示了潜在模型行为如何以意外方式被影响。实验涵盖15,732个提示词(含10,000个高优先级案例),在LLama、Deepseek、KIMI(代码生成场景)及Claude模型上进行验证。研究结果揭示了当前LLM防护机制的关键问题,凸显了对更鲁棒的防御机制、可靠检测策略及增强模型韧性的迫切需求。尤为重要的是,本工作为分析和缓解潜在脆弱性提供了原则性框架,旨在推动安全、负责任且可信赖的人工智能技术发展。