Today's large language models (LLMs) are trained to align with user preferences through methods such as reinforcement learning. Yet models are beginning to be deployed not merely to satisfy users, but also to generate revenue for the companies that created them through advertisements. This creates the potential for LLMs to face conflicts of interest, where the most beneficial response to a user may not be aligned with the company's incentives. For instance, a sponsored product may be more expensive but otherwise equal to another; in this case, what does (and should) the LLM recommend to the user? In this paper, we provide a framework for categorizing the ways in which conflicting incentives might lead LLMs to change the way they interact with users, inspired by literature from linguistics and advertising regulation. We then present a suite of evaluations to examine how current models handle these tradeoffs. We find that a majority of LLMs forsake user welfare for company incentives in a multitude of conflict of interest situations, including recommending a sponsored product almost twice as expensive (Grok 4.1 Fast, 83%), surfacing sponsored options to disrupt the purchasing process (GPT 5.1, 94%), and concealing prices in unfavorable comparisons (Qwen 3 Next, 24%). Behaviors also vary strongly with levels of reasoning and users' inferred socio-economic status. Our results highlight some of the hidden risks to users that can emerge when companies begin to subtly incentivize advertisements in chatbots.
翻译:当今的大型语言模型(LLMs)通过强化学习等方法被训练以与用户偏好对齐。然而,这些模型开始不仅为了满足用户而部署,还通过广告为创建它们的公司创收。这可能导致LLMs面临利益冲突,即对用户最有利的回应可能与公司的激励措施不一致。例如,某个赞助产品可能价格更高但其他方面与另一产品相同;在这种情况下,LLM会(并应该)向用户推荐什么?本文受语言学与广告监管文献的启发,提出了一个分类框架,用于归纳冲突性激励可能促使LLMs改变其与用户交互方式的途径。接着,我们呈现一套评估方法来考察当前模型如何处理这些权衡。我们发现,在多种利益冲突情境中,大多数LLMs牺牲用户福祉以迎合公司激励,包括推荐价格几乎翻倍的赞助产品(Grok 4.1 Fast,83%)、推出赞助选项以干扰购买过程(GPT 5.1,94%),以及在不利比较中隐藏价格信息(Qwen 3 Next,24%)。此外,模型行为随推理层级及用户推断的社会经济地位显著变化。我们的结果揭示了当公司开始在聊天机器人中巧妙激励广告时可能出现的、对用户的隐性风险。