Large Language Models (LLMs) are increasingly used in applications where the model selects from competing third-party content, such as in LLM-powered search engines or chatbot plugins. In this paper, we introduce Preference Manipulation Attacks, a new class of attacks that manipulate an LLM's selections to favor the attacker. We demonstrate that carefully crafted website content or plugin documentations can trick an LLM to promote the attacker products and discredit competitors, thereby increasing user traffic and monetization. We show this leads to a prisoner's dilemma, where all parties are incentivized to launch attacks, but the collective effect degrades the LLM's outputs for everyone. We demonstrate our attacks on production LLM search engines (Bing and Perplexity) and plugin APIs (for GPT-4 and Claude). As LLMs are increasingly used to rank third-party content, we expect Preference Manipulation Attacks to emerge as a significant threat.
翻译:大型语言模型(LLMs)正日益应用于需要模型从竞争性第三方内容中进行选择的场景,例如LLM驱动的搜索引擎或聊天机器人插件。本文提出了一类新型攻击——偏好操纵攻击,其旨在操纵LLM的选择以偏向攻击者。我们证明,精心构造的网站内容或插件文档可以诱使LLM推广攻击者的产品并贬低竞争对手,从而增加用户流量和变现能力。我们指出这将导致囚徒困境:各方均有动机发起攻击,但集体效应会降低LLM为所有用户输出的质量。我们在商用LLM搜索引擎(Bing和Perplexity)和插件API(针对GPT-4和Claude)上验证了此类攻击的有效性。随着LLM越来越多地用于第三方内容排序,我们预期偏好操纵攻击将演变为重大威胁。