Large Language Models (LLMs) have demonstrated strong capabilities as knowledge bases and significant in-context reasoning capabilities. However, previous work challenges their out-of-context reasoning ability, i.e., the ability to infer information from their training data, instead of from the context or prompt. This paper focuses on a significant facet of out-of-context reasoning: Out-of-Context Knowledge Reasoning (OCKR), which is to combine multiple knowledge to infer new knowledge. We designed a synthetic dataset with seven representative OCKR tasks to systematically assess the OCKR capabilities of LLMs. Using this dataset, we evaluated the LLaMA2-13B-chat model and discovered that its proficiency in this aspect is limited, regardless of whether the knowledge is trained in a separate or adjacent training settings. Moreover, training the model to reason with complete reasoning data did not result in significant improvement. Training the model to perform explicit knowledge retrieval helps in only one of the tasks, indicating that the model's limited OCKR capabilities are due to difficulties in retrieving relevant knowledge. Furthermore, we treat cross-lingual knowledge transfer as a distinct form of OCKR, and evaluate this ability. Our results show that the evaluated model also exhibits limited ability in transferring knowledge across languages. The dataset used in this study is available at https://github.com/NJUNLP/ID-OCKR.
翻译:大型语言模型(LLMs)已展现出作为知识库的强大能力以及显著的上下文内推理能力。然而,先前的研究对其上下文外推理能力——即从训练数据中推断信息而非从上下文或提示中推断的能力——提出了质疑。本文聚焦于上下文外推理的一个重要方面:上下文外知识推理(OCKR),即结合多个知识来推断新知识。我们设计了一个包含七项代表性OCKR任务的合成数据集,以系统评估LLMs的OCKR能力。利用该数据集,我们对LLaMA2-13B-chat模型进行了评估,发现无论知识是在独立还是相邻的训练设置中学习,其在该方面的能力均有限。此外,使用完整推理数据训练模型并未带来显著改进。训练模型执行显式知识检索仅在其中一项任务中有所帮助,这表明模型有限的OCKR能力源于检索相关知识的困难。进一步地,我们将跨语言知识迁移视为OCKR的一种独特形式,并评估了这种能力。我们的结果表明,所评估模型在跨语言迁移知识方面同样表现出有限的能力。本研究所用数据集发布于https://github.com/NJUNLP/ID-OCKR。