LLMs increasingly integrate auto-suggestion optimization modules, enabling them to rewrite and display user input before generating the final response. While this design aims to enhance transparency and trust, its process of autonomously selecting a single best result from multiple candidate solutions allows attackers to hijack this optimization process by inducing subtle, imperceptible semantic shifts. To address this, we propose a semantic preservation hijacking attack method based on black-box conditions: Adaptive Greedy Local Search. This method hierarchically decomposes the input text, masks key language units, and dynamically adjusts candidate replacement words at predefined semantic checkpoints. This maximizes the deviation between the model output and the original intent while strictly maintaining semantic similarity to the original text. Experimental results on commercial and open-source LLMs demonstrate that, under the same semantic similarity constraints, this method achieves a higher attack success rate than existing attack methods in over 2400 test cases. Code is available at: https://github.com/franz-chang/DOBS
翻译:大型语言模型日益集成自动建议优化模块,使其能够在生成最终响应前重写并展示用户输入。该设计虽旨在提升透明度和信任度,但其从多个候选解中自主择优的机制,却使攻击者能通过诱导细微且难以察觉的语义偏移来劫持此优化过程。针对此问题,我们提出一种基于黑盒条件的语义保持劫持攻击方法——自适应贪婪局部搜索。该方法对输入文本进行层级分解,掩蔽关键语言单元,并在预设语义检查点动态调整候选替换词,从而在严格保持与原文本语义相似性的同时,最大化模型输出与原意的偏差。在商用及开源LLM上的实验结果显示,在相同语义相似性约束下,该方法在超过2400个测试用例中取得了高于现有攻击方法的成功率。代码开源地址:https://github.com/franz-chang/DOBS