Large language models (LLMs) have emerged as powerful tools for many AI problems and exhibit remarkable in-context learning (ICL) capabilities. Compositional ability, solving unseen complex tasks that combine two or more simple tasks, is an essential reasoning ability for Artificial General Intelligence. Despite LLM's tremendous success, how they approach composite tasks, especially those not encountered during the pretraining phase, remains an open question and largely ununderstood. In this study, we delve into the ICL capabilities of LLMs on composite tasks, with only simple tasks as in-context examples. We develop a test suite of composite tasks that include linguistic and logical challenges and perform empirical studies across different LLM families. We observe that models exhibit divergent behaviors: (1) For simpler composite tasks that apply distinct mapping mechanisms to different input segments, the models demonstrate decent compositional ability, while scaling up the model enhances this ability; (2) for more complex composite tasks that involving reasoning multiple steps, where each step represent one task, models typically underperform, and scaling up generally provide no improvements. We offer theoretical analysis in a simplified setting, explaining that models exhibit compositional capability when the task handles different input parts separately. We believe our work sheds new light on the capabilities of LLMs in solving composite tasks regarding the nature of the tasks and model scale. Our dataset and code are available at {\url{https://github.com/OliverXUZY/LLM_Compose}}.
翻译:大型语言模型已成为解决诸多人工智能问题的强大工具,并展现出卓越的上下文学习能力。组合能力——即通过结合两个或多个简单任务来解决未见复杂任务的能力——是实现通用人工智能所需的关键推理能力。尽管大型语言模型取得了巨大成功,它们如何处理组合任务(尤其是预训练阶段未接触过的任务)仍是一个悬而未决且尚未被充分理解的课题。本研究深入探究了大型语言模型在仅以简单任务作为上下文示例的情况下处理组合任务的能力。我们开发了一套包含语言与逻辑挑战的组合任务测试集,并对不同系列的大型语言模型进行了实证研究。研究发现模型表现出两种不同行为模式:(1)对于需要对不同输入片段应用独立映射机制的较简单组合任务,模型展现出良好的组合能力,且模型规模的扩大会增强这种能力;(2)对于涉及多步推理(每一步代表一个子任务)的更复杂组合任务,模型通常表现欠佳,且模型规模的扩大通常不会带来改进。我们在简化设定下进行了理论分析,证明当任务能够分别处理不同输入部分时,模型会展现出组合能力。本研究从任务本质与模型规模的角度,为理解大型语言模型解决组合任务的能力提供了新的见解。我们的数据集与代码公开于 {\url{https://github.com/OliverXUZY/LLM_Compose}}。