Non-stationary environments require agents to revise previously learned action values when contingencies change. We treat large language models (LLMs) as sequential decision policies in a two-option probabilistic reversal-learning task with three latent states and switch events triggered by either a performance criterion or timeout. We compare a deterministic fixed transition cycle to a stochastic random schedule that increases volatility, and evaluate DeepSeek-V3.2, Gemini-3, and GPT-5.2, with human data as a behavioural reference. Across models, win-stay was near ceiling while lose-shift was markedly attenuated, revealing asymmetric use of positive versus negative evidence. DeepSeek-V3.2 showed extreme perseveration after reversals and weak acquisition, whereas Gemini-3 and GPT-5.2 adapted more rapidly but still remained less loss-sensitive than humans. Random transitions amplified reversal-specific persistence across LLMs yet did not uniformly reduce total wins, demonstrating that high aggregate payoff can coexist with rigid adaptation. Hierarchical reinforcement-learning (RL) fits indicate dissociable mechanisms: rigidity can arise from weak loss learning, inflated policy determinism, or value polarisation via counterfactual suppression. These results motivate reversal-sensitive diagnostics and volatility-aware models for evaluating LLMs under non-stationary uncertainty.
翻译:非平稳环境要求智能体在事件条件变化时修订先前习得的动作价值。我们将大语言模型视为双选项概率反转学习任务中的序列决策策略,该任务包含三个潜在状态,并由表现标准或超时触发状态转换。我们比较了确定性固定转换周期与增加波动性的随机时间表,以人类数据作为行为参照,评估了DeepSeek-V3.2、Gemini-3和GPT-5.2。各模型在"赢则保持"策略上近乎达到上限,而"输则转换"显著减弱,揭示了正负证据利用的非对称性。DeepSeek-V3.2在反转后表现出极端持续性且习得能力薄弱,而Gemini-3和GPT-5.2适应更迅速,但对损失敏感性仍低于人类。随机转换增强了所有大语言模型的反转特异性持续性,但并未统一降低总获胜次数,表明高聚合收益可与刚性适应并存。分层强化学习拟合揭示了可分离机制:刚性可能源于弱损失学习、策略确定性膨胀或通过反事实抑制导致的价值极化。这些结果推动了在非平稳不确定性下评估大语言模型的反转敏感性诊断指标与波动感知模型的发展。