In "Embers of Autoregression" (McCoy et al., 2023), we showed that several large language models (LLMs) have some important limitations that are attributable to their origins in next-word prediction. Here we investigate whether these issues persist with o1, a new system from OpenAI that differs from previous LLMs in that it is optimized for reasoning. We find that o1 substantially outperforms previous LLMs in many cases, with particularly large improvements on rare variants of common tasks (e.g., forming acronyms from the second letter of each word in a list, rather than the first letter). Despite these quantitative improvements, however, o1 still displays the same qualitative trends that we observed in previous systems. Specifically, o1 - like previous LLMs - is sensitive to the probability of examples and tasks, performing better and requiring fewer "thinking tokens" in high-probability settings than in low-probability ones. These results show that optimizing a language model for reasoning can mitigate but might not fully overcome the language model's probability sensitivity.
翻译:在《自回归的余烬》(McCoy 等人,2023)中,我们揭示了若干大型语言模型(LLM)存在一些重要局限,这些局限可归因于其源于下一词预测的起源。本文探讨这些局限在 OpenAI 新系统 o1 中是否依然存在——该系统与先前 LLM 的不同之处在于其专为推理任务优化。我们发现,o1 在多数情况下显著优于先前 LLM,尤其在常见任务的罕见变体上提升尤为明显(例如,从列表中每个词的第二个字母而非首字母生成首字母缩略词)。然而,尽管存在这些量化改进,o1 仍表现出与先前系统相同的定性趋势。具体而言,o1 与此前 LLM 类似,对示例和任务的概率敏感:在高概率场景中表现更优且所需“思考令牌”更少,而在低概率场景中则相反。这些结果表明,为推理优化语言模型虽可缓解其概率敏感性,但可能无法完全克服该特性。