We analyze the capabilities of Transformer language models in learning compositional discrete tasks. To this end, we evaluate training LLaMA models and prompting GPT-4 and Gemini on four tasks demanding to learn a composition of several discrete sub-tasks. In particular, we measure how well these models can reuse primitives observable in the sub-tasks to learn the composition task. Our results indicate that compositional learning in state-of-the-art Transformer language models is highly sample inefficient: LLaMA requires more data samples than relearning all sub-tasks from scratch to learn the compositional task; in-context prompting with few samples is unreliable and fails at executing the sub-tasks or correcting the errors in multi-round code generation. Further, by leveraging complexity theory, we support these findings with a theoretical analysis focused on the sample inefficiency of gradient descent in memorizing feedforward models. We open source our code at https://github.com/IBM/limitations-lm-algorithmic-compositional-learning.
翻译:本文分析了Transformer语言模型在学习组合式离散任务方面的能力。为此,我们在四个需要学习多个离散子任务组合的任务上,评估了LLaMA模型的训练效果以及GPT-4和Gemini模型的提示学习性能。我们特别测量了这些模型在多大程度上能够复用子任务中可观察到的基元来学习组合任务。研究结果表明,当前最先进的Transformer语言模型在组合学习方面样本效率极低:LLaMA学习组合任务所需的数据样本量甚至超过从头重新学习所有子任务所需的样本量;基于少量样本的上下文提示学习不可靠,无法正确执行子任务或在多轮代码生成中修正错误。此外,通过运用计算复杂性理论,我们以梯度下降在记忆前馈模型中的样本低效性为重点进行了理论分析,为上述实证发现提供了理论支持。我们在https://github.com/IBM/limitations-lm-algorithmic-compositional-learning开源了相关代码。