We test whether LLMs show robust decision biases. Treating models as participants in two-arm bandits, we ran 20000 trials per condition across four decoding configurations. Under symmetric rewards, models amplified positional order into stubborn one-arm policies. Under asymmetric rewards, they exploited rigidly yet underperformed an oracle and rarely re-checked. The observed patterns were consistent across manipulations of temperature and top-p, with top-k held at the provider default, indicating that the qualitative behaviours are robust to the two decoding knobs typically available to practitioners. Crucially, moving beyond descriptive metrics to computational modelling, a hierarchical Rescorla-Wagner-softmax fit revealed the underlying strategies: low learning rates and very high inverse temperatures, which together explain both noise-to-bias amplification and rigid exploitation. These results position minimal bandits as a tractable probe of LLM decision tendencies and motivate hypotheses about how such biases could shape human-AI interaction.
翻译:我们测试了大语言模型是否表现出稳健的决策偏差。通过将模型视为双臂赌博机中的参与者,我们在四种解码配置下对每种条件进行了总计20000次试验。在对称奖励条件下,模型将位置顺序效应放大为顽固的单臂策略。在非对称奖励条件下,模型表现出刚性利用行为,其表现逊于最优策略,且极少进行重新检查。观察到的模式在对温度和top-p的多种操控下均保持一致(top-k保持为提供商默认值),这表明其定性行为对从业者通常可用的两种解码参数具有稳健性。关键的是,超越描述性指标而采用计算建模方法,通过分层Rescorla-Wagner-softmax模型拟合揭示了其底层策略:低学习率与极高的逆温度参数,二者共同解释了从噪声到偏差的放大现象以及刚性利用行为。这些研究结果确立了最小化赌博机范式作为探究大语言模型决策倾向的有效工具,并为进一步研究此类偏差如何影响人机交互提供了理论假设。