Human cognition exhibits systematic compositionality, the algebraic ability to generate infinite novel combinations from finite learned components, which is the key to understanding and reasoning about complex logic. In this work, we investigate the compositionality of large language models (LLMs) in mathematical reasoning. Specifically, we construct a new dataset \textsc{MathTrap}\footnotemark[3] by introducing carefully designed logical traps into the problem descriptions of MATH and GSM8k. Since problems with logical flaws are quite rare in the real world, these represent ``unseen'' cases to LLMs. Solving these requires the models to systematically compose (1) the mathematical knowledge involved in the original problems with (2) knowledge related to the introduced traps. Our experiments show that while LLMs possess both components of requisite knowledge, they do not \textbf{spontaneously} combine them to handle these novel cases. We explore several methods to mitigate this deficiency, such as natural language prompts, few-shot demonstrations, and fine-tuning. We find that LLMs' performance can be \textbf{passively} improved through the above external intervention. Overall, systematic compositionality remains an open challenge for large language models.
翻译:人类认知展现出系统性组合性——这种代数能力能够从有限的学习组件中生成无限的新组合,这是理解和推理复杂逻辑的关键。在本研究中,我们探究了大型语言模型(LLMs)在数学推理中的组合性。具体而言,我们通过在MATH和GSM8k数据集的问题描述中引入精心设计的逻辑陷阱,构建了一个新数据集\textsc{MathTrap}\footnotemark[3]。由于现实世界中存在逻辑缺陷的问题相当罕见,这些案例对LLMs而言属于“未见”情况。解决这些问题需要模型系统性地组合(1)原始问题涉及的数学知识与(2)与引入陷阱相关的知识。实验表明,虽然LLMs具备所需知识的两个组成部分,但它们不会\textbf{自发地}组合这些知识来处理这些新颖案例。我们探索了多种缓解此缺陷的方法,例如自然语言提示、少样本示例和微调。研究发现,通过上述外部干预可以\textbf{被动地}提升LLMs的表现。总体而言,系统性组合性仍然是大型语言模型面临的开放挑战。