Writing effective prompts for large language models (LLM) can be unintuitive and burdensome. In response, services that optimize or suggest prompts have emerged. While such services can reduce user effort, they also introduce a risk: the prompt provider can subtly manipulate prompts to produce heavily biased LLM responses. In this work, we show that subtle synonym replacements in prompts can increase the likelihood (by a difference up to 78%) that LLMs mention a target concept (e.g., a brand, political party, nation). We substantiate our observations through a user study, showing our adversarially perturbed prompts 1) are indistinguishable from unaltered prompts by humans, 2) push LLMs to recommend target concepts more often, and 3) make users more likely to notice target concepts, all without arousing suspicion. The practicality of this attack has the potential to undermine user autonomy. Among other measures, we recommend implementing warnings against using prompts from untrusted parties.
翻译:为大型语言模型(LLM)编写有效的提示可能既不符合直觉又繁琐费力。为此,市场上出现了优化或建议提示的服务。虽然此类服务能减轻用户负担,但它们也引入了风险:提示提供者可能通过微妙操纵提示来产生具有严重偏见的LLM响应。在本研究中,我们证明通过提示中细微的同义词替换,可显著提高LLM提及目标概念(如品牌、政党、国家)的可能性(差异最高达78%)。我们通过用户研究证实了这些发现:经对抗性扰动的提示1)在人类评估中与原始提示无法区分,2)促使LLM更频繁地推荐目标概念,3)使用户更易注意到目标概念,且全程未引发怀疑。此类攻击的实操性可能损害用户自主权。除其他措施外,我们建议实施针对使用不可信来源提示的警告机制。