Auditing Preferences for Brands and Cultures in LLMs

Large language models (LLMs) based AI systems increasingly mediate what billions of people see, choose and buy. This creates an urgent need to quantify the systemic risks of LLM-driven market intermediation, including its implications for market fairness, competition, and the diversity of information exposure. This paper introduces ChoiceEval, a reproducible framework for auditing preferences for brands and cultures in large language models (LLMs) under realistic usage conditions. ChoiceEval addresses two core technical challenges: (i) generating realistic, persona-diverse evaluation queries and (ii) converting free-form outputs into comparable choice sets and quantitative preference metrics. For a given topic (e.g. running shoes, hotel chains, travel destinations), the framework segments users into psychographic profiles (e.g., budget-conscious, wellness-focused, convenience), and then derives diverse prompts that reflect real-world advice-seeking and decision-making behaviour. LLM responses are converted into normalised top-k choice sets. Preference and geographic bias are then quantified using comparable metrics across topics and personas. Thus, ChoiceEval provides a scalable audit pipeline for researchers, platforms, and regulators, linking model behaviour to real-world economic outcomes. Applied to Gemini, GPT, and DeepSeek across 10 topics spanning commerce and culture and more than 2,000 questions, ChoiceEval reveals consistent preferences: U.S.-developed models Gemini and GPT show marked favouritism toward American entities, while China-developed DeepSeek exhibits more balanced yet still detectable geographic preferences. These patterns persist across user personas, suggesting systematic rather than incidental effects.

翻译：基于大语言模型（LLMs）的人工智能系统正日益成为数十亿人观看、选择和消费的中介。这迫切需要量化LLM驱动的市场中介所带来的系统性风险，包括其对市场公平性、竞争以及信息接触多样性的影响。本文介绍了ChoiceEval——一个可复现的框架，用于在实际使用条件下审计大语言模型（LLMs）中的品牌与文化偏好。ChoiceEval解决了两个核心技术难题：(i) 生成具有人物画像多样性的真实评价查询，以及(ii) 将自由形式的输出转化为可比较的选项集与定量偏好指标。针对特定主题（例如跑鞋、连锁酒店、旅游目的地），该框架将用户细分为心理画像档案（如精打细算型、健康重视型、便利导向型），然后衍生出反映现实世界中寻求建议和决策行为的多样化提示词。LLM的响应被转化为归一化的top-k选项集。偏好与地域偏差随后通过跨主题和人物画像的可比较指标进行量化。因此，ChoiceEval为研究人员、平台和监管机构提供了一个可扩展的审计管道，将模型行为与现实经济结果联系起来。将该框架应用于Gemini、GPT和DeepSeek，覆盖商业与文化的10个主题及超过2000个问题，ChoiceEval揭示了一致的偏好：美国开发的模型Gemini和GPT表现出对美国实体明显的偏好，而中国开发的DeepSeek则展现出更均衡但仍可检测到的地域偏好。这些模式在不同用户画像中持续存在，表明是系统性而非偶然性效应。