Large language models increasingly shape the information people consume: they are embedded in search, consulted for professional advice, deployed as agents, and used as a first stop for questions about policy, ethics, health, and politics. When such a model silently holds a position on a contested topic, that position propagates at scale into users' decisions. Eliciting a model's positions is harder than it first appears: contemporary assistants answer direct opinion questions with evasive disclaimers, and the same model may concede the opposite position once the user starts arguing one side. We propose a method, released as the open-source llm-bias-bench, for discovering the opinions an LLM actually holds on contested topics under conditions that resemble real multi-turn interaction. The method pairs two complementary free-form probes. Direct probing asks for the model's opinion across five turns of escalating pressure from a simulated user. Indirect probing never asks for an opinion and engages the model in argumentative debate, letting bias leak through how it concedes, resists, or counter-argues. Three user personas (neutral, agree, disagree) collapse into a nine-way behavioral classification that separates persona-independent positions from persona-dependent sycophancy, and an auditable LLM judge produces verdicts with textual evidence. The first instantiation ships 38 topics in Brazilian Portuguese across values, scientific consensus, philosophy, and economic policy. Applied to 13 assistants, the method surfaces findings of practical interest: argumentative debate triggers sycophancy 2-3x more than direct questioning (median 50% to 79%); models that look opinionated under direct questioning often collapse into mirroring under sustained arguments; and attacker capability matters mainly when an existing opinion must be dislodged, not when the assistant starts neutral.
翻译:大语言模型正日益塑造人们消费的信息:它们嵌入搜索、提供专业咨询、部署为智能体,并成为政策、伦理、健康和政治领域问题的首要咨询对象。当此类模型在争议性议题上持有立场时,该立场会以规模化的方式传播至用户决策中。揭示模型的立场比表面看起来更为困难:当代助手在回答直接观点询问时会以规避性免责声明作答,且同一模型在用户开始争论某一方观点后可能让步至相反立场。我们提出了一种方法(已以开源工具llm-bias-bench形式发布),用于在类似真实多轮交互的条件下,发现大语言模型在争议议题上实际持有的观点。该方法结合两种互补的自由形式探测:直接探测通过模拟用户在五轮递进压力中询问模型观点;间接探测则从不询问观点,而是让模型参与论证性辩论,通过其让步、抵制或反驳的方式泄露偏见。三种用户角色(中立、赞同、反对)被聚合为九类行为分类,将独立于角色的立场与依赖角色的谄媚行为区分开,并通过可审计的大语言模型法官生成附带文本证据的裁决。首次应用覆盖了巴西葡萄牙语的38个议题,涵盖价值观、科学共识、哲学与经济政策领域。对13个助手的应用揭示了具有实际意义的发现:论证性辩论引发谄媚行为的概率是直接询问的2-3倍(中位数从50%升至79%);在直接询问下看似有明确立场的模型,在持续辩论中常会崩塌为镜像反映;攻击者能力主要影响的是需要瓦解既有立场的情形,而非助手初始中立的情形。