A Two-Stage Statistical Framework for Evaluating Associative Interference in Large Language Models

Large language models (LLMs) are increasingly evaluated for bias using adaptations of human psychological paradigms, yet methodological limitations-particularly the conflation of refusal behavior with task performance-have hindered clear interpretation. Here, we adapt the Implicit Association Test (IAT) to a controlled, forced-choice framework and introduce a two-stage modeling approach that separates response compliance from task-consistent classification. Across three contemporary LLMs (Claude Sonnet-4, Gemini 2.5 Pro, and GPT-5), we evaluate associative interference, defined as reduced task-consistency in incongruent relative to congruent conditions. While compliance with the structured response format was uniformly high, interference effects varied substantially across models and domains. Claude Sonnet-4 exhibited strong interference in the Gender--Career domain (DeltaP = 0.086, 95% CrI [0.026, 0.173]) and smaller but credible effects in Gender--Science. Gemini 2.5 Pro showed attenuated interference, and GPT-5 exhibited minimal or no detectable interference across domains. These findings demonstrate that IAT-style associative asymmetries are not a universal property of LLMs, but instead depend on model-specific characteristics. By isolating interference from compliance and modeling item-level variability, this study provides a principled framework for evaluating structured response patterns in LLMs. The results highlight the importance of model-specific assessment and suggest that associative interference can be substantially mitigated in modern systems.

翻译：大型语言模型（LLMs）日益通过改编人类心理学范式来评估其偏差，然而方法学上的局限——特别是将拒绝行为与任务表现混为一谈——阻碍了对结果的清晰解读。本文我们将隐性关联测验（IAT）适配到受控的强制选择框架中，并引入一个两阶段建模方法，将响应顺从性与任务一致性分类相分离。在三种当代大语言模型（Claude Sonnet-4、Gemini 2.5 Pro与GPT-5）上，我们评估了关联干扰——定义为相对于一致条件，在不一致条件下任务一致性降低的现象。尽管对结构化响应格式的顺从性普遍较高，但干扰效应在不同模型和领域间存在显著差异。Claude Sonnet-4在性别—职业领域表现出强干扰（DeltaP = 0.086，95%可信区间[0.026, 0.173]），在性别—科学领域则呈现较小但可靠的效应。Gemini 2.5 Pro表现为衰减的干扰，而GPT-5在各领域均未检测到或仅检测到极微弱的干扰。这些发现表明，IAT式的关联不对称性并非大语言模型的普遍属性，而是取决于模型特定特征。通过将干扰与顺从性分离并对项目级别变异建模，本研究为评估大语言模型中的结构化响应模式提供了一个原则性框架。结果凸显了模型特异性评估的重要性，并提示在现代系统中关联干扰可被显著缓解。