Large language models (LLMs) based AI systems increasingly mediate what billions of people see, choose and buy. This creates an urgent need to quantify the systemic risks of LLM-driven market intermediation, including its implications for market fairness, competition, and the diversity of information exposure. This paper introduces ChoiceEval, a reproducible framework for auditing preferences for brands and cultures in large language models (LLMs) under realistic usage conditions. ChoiceEval addresses two core technical challenges: (i) generating realistic, persona-diverse evaluation queries and (ii) converting free-form outputs into comparable choice sets and quantitative preference metrics. For a given topic (e.g. running shoes, hotel chains, travel destinations), the framework segments users into psychographic profiles (e.g., budget-conscious, wellness-focused, convenience), and then derives diverse prompts that reflect real-world advice-seeking and decision-making behaviour. LLM responses are converted into normalised top-k choice sets. Preference and geographic bias are then quantified using comparable metrics across topics and personas. Thus, ChoiceEval provides a scalable audit pipeline for researchers, platforms, and regulators, linking model behaviour to real-world economic outcomes. Applied to Gemini, GPT, and DeepSeek across 10 topics spanning commerce and culture and more than 2,000 questions, ChoiceEval reveals consistent preferences: U.S.-developed models Gemini and GPT show marked favouritism toward American entities, while China-developed DeepSeek exhibits more balanced yet still detectable geographic preferences. These patterns persist across user personas, suggesting systematic rather than incidental effects.
翻译:基于大语言模型(LLMs)的人工智能系统正日益成为数十亿人观看、选择和消费的中介。这迫切需要量化LLM驱动的市场中介所带来的系统性风险,包括其对市场公平性、竞争以及信息接触多样性的影响。本文介绍了ChoiceEval——一个可复现的框架,用于在实际使用条件下审计大语言模型(LLMs)中的品牌与文化偏好。ChoiceEval解决了两个核心技术难题:(i) 生成具有人物画像多样性的真实评价查询,以及(ii) 将自由形式的输出转化为可比较的选项集与定量偏好指标。针对特定主题(例如跑鞋、连锁酒店、旅游目的地),该框架将用户细分为心理画像档案(如精打细算型、健康重视型、便利导向型),然后衍生出反映现实世界中寻求建议和决策行为的多样化提示词。LLM的响应被转化为归一化的top-k选项集。偏好与地域偏差随后通过跨主题和人物画像的可比较指标进行量化。因此,ChoiceEval为研究人员、平台和监管机构提供了一个可扩展的审计管道,将模型行为与现实经济结果联系起来。将该框架应用于Gemini、GPT和DeepSeek,覆盖商业与文化的10个主题及超过2000个问题,ChoiceEval揭示了一致的偏好:美国开发的模型Gemini和GPT表现出对美国实体明显的偏好,而中国开发的DeepSeek则展现出更均衡但仍可检测到的地域偏好。这些模式在不同用户画像中持续存在,表明是系统性而非偶然性效应。