Writing effective prompts for large language models (LLM) can be unintuitive and burdensome. In response, services that optimize or suggest prompts have emerged. While such services can reduce user effort, they also introduce a risk: the prompt provider can subtly manipulate prompts to produce heavily biased LLM responses. In this work, we show that subtle synonym replacements in prompts can increase the likelihood (by a difference up to 78%) that LLMs mention a target concept (e.g., a brand, political party, nation). We substantiate our observations through a user study, showing our adversarially perturbed prompts 1) are indistinguishable from unaltered prompts by humans, 2) push LLMs to recommend target concepts more often, and 3) make users more likely to notice target concepts, all without arousing suspicion. The practicality of this attack has the potential to undermine user autonomy. Among other measures, we recommend implementing warnings against using prompts from untrusted parties.
翻译:为大型语言模型(LLM)撰写有效的提示可能既不符合直觉又繁琐费力。为此,优化或建议提示的服务应运而生。虽然此类服务可以减少用户的工作量,但它们也引入了一种风险:提示提供者可以巧妙地操纵提示,以产生带有严重偏见的LLM响应。在本研究中,我们证明,提示中微妙的同义词替换可以将LLM提及目标概念(例如品牌、政党、国家)的可能性提高最多78%。我们通过一项用户研究证实了我们的观察,表明我们经过对抗性扰动的提示:1)在人类看来与未修改的提示无法区分;2)促使LLM更频繁地推荐目标概念;3)使用户更有可能注意到目标概念,且全程不会引起怀疑。这种攻击的实用性有可能损害用户自主性。除其他措施外,我们建议实施针对使用来自非可信方提示的警告。