Prior research demonstrates that performance of language models on reasoning tasks can be influenced by suggestions, hints and endorsements. However, the influence of endorsement source credibility remains underexplored. We investigate whether language models exhibit systematic bias based on the perceived expertise of the provider of the endorsement. Across 4 datasets spanning mathematical, legal, and medical reasoning, we evaluate 11 models using personas representing four expertise levels per domain. Our results reveal that models are increasingly susceptible to incorrect/misleading endorsements as source expertise increases, with higher-authority sources inducing not only accuracy degradation but also increased confidence in wrong answers. We also show that this authority bias is mechanistically encoded within the model and a model can be steered away from the bias, thereby improving its performance even when an expert gives a misleading endorsement.
翻译:先前研究表明,语言模型在推理任务上的表现可能受到建议、提示和背书的影响。然而,背书来源可信度的影响仍未得到充分探索。我们研究了语言模型是否会基于背书提供者的感知专业水平表现出系统性偏见。通过在涵盖数学、法律和医学推理的4个数据集上,使用代表每个领域四种专业水平的人物角色评估了11个模型,我们的结果显示:随着来源专业水平的提高,模型对错误/误导性背书的敏感性逐渐增强,更高权威的来源不仅导致准确性下降,还增加了对错误答案的置信度。我们还证明这种权威偏见在模型内部以机制化方式编码,并且可以通过引导使模型偏离这种偏见,从而即使在专家给出误导性背书时也能提升其表现。