We study the error rate of LLMs on tasks like arithmetic that require a deterministic output, and repetitive processing of tokens drawn from a small set of alternatives. We argue that incorrect predictions arise when small errors in the attention mechanism accumulate to cross a threshold, and use this insight to derive a quantitative two-parameter relationship between the accuracy and the complexity of the task. The two parameters vary with the prompt and the model; they can be interpreted in terms of an elementary noise rate, and the number of plausible erroneous tokens that can be predicted. Our analysis is inspired by an ``effective field theory'' perspective: the LLM's many raw parameters can be reorganized into just two parameters that govern the error rate. We perform extensive empirical tests, using Gemini 2.5 Flash, Gemini 2.5 Pro and DeepSeek R1, and find excellent agreement between the predicted and observed accuracy for a variety of tasks, although we also identify deviations in some cases. Our model provides an alternative to suggestions that errors made by LLMs on long repetitive tasks indicate the ``collapse of reasoning'', or an inability to express ``compositional'' functions. Finally, we show how to construct prompts to reduce the error rate.
翻译:我们研究了大型语言模型在需要确定性输出的任务(如算术运算)以及从少量备选标记中重复处理时的错误率。我们认为,当注意力机制中的微小误差累积超过阈值时,就会产生错误预测,并利用这一观点推导出任务准确率与复杂性之间的定量双参数关系。这两个参数随提示词和模型变化,可解释为基础噪声率与可能被预测的错误标记数量。我们的分析受到"有效场论"视角的启发:大型语言模型的众多原始参数可重组为仅控制错误率的两个参数。我们使用Gemini 2.5 Flash、Gemini 2.5 Pro和DeepSeek R1进行了大量实证测试,发现在多种任务中预测准确率与观测准确率高度吻合,但也识别出某些情况下的偏差。该模型为以下观点提供了替代解释:大型语言模型在长重复任务中出现的错误并非必然意味着"推理崩溃"或无法表达"组合式"函数。最后,我们展示了如何构建提示词以降低错误率。