We investigate the extent to which Large Language Models (LLMs) can simulate the execution of computer code and algorithms. We begin by looking straight line programs, and show that current LLMs demonstrate poor performance even with such simple programs -- performance rapidly degrades with the length of code. We then investigate the ability of LLMs to simulate programs that contain critical paths and redundant instructions. We also go beyond straight line program simulation with sorting algorithms and nested loops, and we show the computational complexity of a routine directly affects the ability of an LLM to simulate its execution. We observe that LLMs execute instructions sequentially and with a low error margin only for short programs or standard procedures. LLMs' code simulation is in tension with their pattern recognition and memorisation capabilities: on tasks where memorisation is detrimental, we propose a novel prompting method to simulate code execution line by line. Empirically, our new Chain of Simulation (CoSm) method improves on the standard Chain of Thought prompting approach by avoiding the pitfalls of memorisation.
翻译:我们研究了大型语言模型(LLMs)模拟计算机代码与算法执行过程的程度。首先,我们考察了直线型程序,发现当前LLMs即便对这类简单程序的表现也差强人意——随着代码长度增加,其性能迅速退化。接着,我们探究了LLMs模拟包含关键路径与冗余指令的程序的能力。此外,我们超越直线型程序模拟,研究了排序算法与嵌套循环,并证明程序的直接计算复杂度会影响LLMs模拟其执行的能力。我们观察到,LLMs仅对短程序或标准流程能实现低错误率的顺序指令执行。代码模拟能力与其模式识别与记忆能力存在矛盾:在记忆可能造成干扰的任务中,我们提出了一种逐行模拟代码执行的新型提示方法。实验表明,我们提出的"模拟链"(CoSm)方法通过避免记忆陷阱,优于标准思维链提示方法。