We demonstrate that, through appropriate prompting, GPT-3 family of models can be triggered to perform iterative behaviours necessary to execute (rather than just write or recall) programs that involve loops, including several popular algorithms found in computer science curricula or software developer interviews. We trigger execution and description of Iterations by Regimenting Self-Attention (IRSA) in one (or a combination) of three ways: 1) Using strong repetitive structure in an example of an execution path of a target program for one particular input, 2) Prompting with fragments of execution paths, and 3) Explicitly forbidding (skipping) self-attention to parts of the generated text. On a dynamic program execution, IRSA leads to larger accuracy gains than replacing the model with the much more powerful GPT-4. IRSA has promising applications in education, as the prompts and responses resemble student assignments in data structures and algorithms classes. Our findings hold implications for evaluating LLMs, which typically target the in-context learning: We show that prompts that may not even cover one full task example can trigger algorithmic behaviour, allowing solving problems previously thought of as hard for LLMs, such as logical puzzles. Consequently, prompt design plays an even more critical role in LLM performance than previously recognized.
翻译:我们证明,通过适当的提示工程,GPT-3系列模型可以被触发执行迭代行为,从而运行(而非仅编写或回忆)包含循环的程序,包括计算机科学课程或软件开发者面试中的多种经典算法。我们通过三种方式之一(或组合)的"自注意力规则化迭代触发"(IRSA)来触发迭代的执行与描述:1)在目标程序针对特定输入的执行路径示例中使用强重复结构;2)使用执行路径片段进行提示;3)明确禁止(跳过)生成文本部分的自注意力。在动态程序执行中,IRSA带来的准确率提升甚至超过将模型替换为更强大的GPT-4。IRSA在教育领域具有广阔应用前景,因其提示与响应模式类似于数据结构与算法课程中的学生作业。本研究对评估通常聚焦于上下文学习能力的大语言模型(LLM)具有启示意义:我们证明,甚至不包含完整任务示例的提示也能触发算法行为,从而解决先前被认为对LLM具有挑战性的问题(如逻辑谜题)。因此,提示设计在LLM性能中扮演着比以往认知更为关键的角色。