Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve

The widespread adoption of large language models (LLMs) makes it important to recognize their strengths and limitations. We argue that in order to develop a holistic understanding of these systems we need to consider the problem that they were trained to solve: next-word prediction over Internet text. By recognizing the pressures that this task exerts we can make predictions about the strategies that LLMs will adopt, allowing us to reason about when they will succeed or fail. This approach - which we call the teleological approach - leads us to identify three factors that we hypothesize will influence LLM accuracy: the probability of the task to be performed, the probability of the target output, and the probability of the provided input. We predict that LLMs will achieve higher accuracy when these probabilities are high than when they are low - even in deterministic settings where probability should not matter. To test our predictions, we evaluate two LLMs (GPT-3.5 and GPT-4) on eleven tasks, and we find robust evidence that LLMs are influenced by probability in the ways that we have hypothesized. In many cases, the experiments reveal surprising failure modes. For instance, GPT-4's accuracy at decoding a simple cipher is 51% when the output is a high-probability word sequence but only 13% when it is low-probability. These results show that AI practitioners should be careful about using LLMs in low-probability situations. More broadly, we conclude that we should not evaluate LLMs as if they are humans but should instead treat them as a distinct type of system - one that has been shaped by its own particular set of pressures.

翻译：大型语言模型（LLMs）的广泛采用使其优势与局限的识别变得重要。我们认为，要全面理解这些系统，必须考虑其训练目标：对互联网文本进行下一词预测。通过认识这一任务施加的压力，我们可以预测LLMs将采用的策略，从而推断其成功或失败的条件。这一方法——我们称之为目的论方法——引导我们识别出三个假设影响LLM准确性的因素：待执行任务的概率、目标输出的概率以及给定输入的概率。我们预测，当这些概率较高时，LLMs的准确率将高于概率较低的情况——即使在概率不应起作用的确定性环境中也是如此。为验证预测，我们在十一项任务上评估了两个LLM（GPT-3.5和GPT-4），并发现了强有力的证据表明LLMs确实以我们假设的方式受到概率影响。许多实验揭示了令人惊讶的失败模式。例如，当输出为高概率词序列时，GPT-4解码简单密码的准确率为51%，而当其为低概率时，准确率仅为13%。这些结果表明，AI从业者应谨慎在低概率情境下使用LLMs。更广泛而言，我们得出结论：不应将LLMs视为人类来评估，而应将其视为一种独特的系统——其行为由自身特有的压力塑造而成。