AMEL: Accumulated Message Effects on LLM Judgments

from arxiv, 19 pages, 14 figures, 6 tables. Single author. Code, data (75,898 deduplicated API responses), and analysis pipeline at https://github.com/chutapp/amel

Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items passing through one conversation. We ask whether the polarity of prior conversation history biases subsequent judgments, an effect we call the accumulated message effect on LLM judgments (AMEL). Across 75,898 API calls to 11 models from 4 providers (OpenAI, Anthropic, Google, and four open-source models), we present identical test items in isolation or following histories saturated with predominantly positive or negative evaluations. Models shift toward the conversation's prevailing polarity (d = -0.17, p < 10^-46). The effect concentrates on items where the model is genuinely uncertain at baseline (d = -0.34 for high-entropy items, vs d = -0.15 when the baseline is deterministic). Bias does not grow with context length: 5 prior turns and 50 produce the same shift (Spearman |r| < 0.01; OLS slope p = 0.80). And there is a negativity asymmetry: paired per item, negative histories induce 1.62x more bias than positive (t = 13.46, p < 10^-39, n = 2,481). Scaling helps but does not solve it (Anthropic: Haiku -0.22 to Opus -0.17; OpenAI: Nano -0.34 to GPT-5.2 -0.17). Three follow-ups narrow the mechanism. The token probability distribution shifts continuously, not at a threshold. The negativity asymmetry has both token-level and semantic components, though attributing the balance is exploratory at our sample sizes. Position does not matter: five biased turns anywhere in a 50-turn history produce the same shift. The simplest fix for evaluation pipelines is a fresh context per item; when batching is unavoidable, balancing the history helps.

翻译：大型语言模型常被用作自动评估工具：审查代码、审核内容或对输出评分，且通常在一个对话中处理多项内容。我们探究对话历史的情感倾向是否会使后续判断产生偏差，这种效应称为“累积消息对LLM判断的影响”（AMEL）。通过对4家提供商（OpenAI、Anthropic、Google及四个开源模型）的11个模型进行75,898次API调用，我们以孤立形式或在充斥着大量正面或负面评价的历史后呈现相同的测试项。模型结果会向对话的主导情感偏移（d = -0.17, p < 10^-46）。该效应集中于模型基线状态下真正不确定的项上（高熵项d = -0.34，而确定性基线下d = -0.15）。偏差不会随上下文长度增长：5轮历史与50轮历史产生的偏移程度相同（Spearman |r| < 0.01；OLS斜率p = 0.80）。且存在负向不对称性：逐项配对比较时，负面历史引发的偏差是正面历史的1.62倍（t = 13.46, p < 10^-39, n = 2,481）。扩大模型规模可缓解但无法根除该问题（Anthropic：Haiku -0.22至Opus -0.17；OpenAI：Nano -0.34至GPT-5.2 -0.17）。三项后续实验缩小了机制范围：词元概率分布呈连续变化而非阈值突变；负向不对称性兼具词元级和语义级成分，但本样本量下对两者贡献的归因仍属探索性研究；位置无关紧要——50轮历史中任意位置的5轮偏见轮次均产生相同偏移。对评估流水线而言，最简单的修复方案是对每项内容使用全新上下文；当必须批量处理时，平衡历史情感倾向有助于缓解偏差。