Large language models (LLMs) have made remarkable progress in various natural language processing tasks as a benefit of their capability to comprehend and reason with factual knowledge. However, a significant amount of factual knowledge is stored in structured data, which possesses unique characteristics that differ from the unstructured texts used for pretraining. This difference can introduce imperceptible inference parameter deviations, posing challenges for LLMs in effectively utilizing and reasoning with structured data to accurately infer factual knowledge. To this end, we propose a benchmark named StructFact, to evaluate the structural reasoning capabilities of LLMs in inferring factual knowledge. StructFact comprises 8,340 factual questions encompassing various tasks, domains, timelines, and regions. This benchmark allows us to investigate the capability of LLMs across five factual tasks derived from the unique characteristics of structural facts. Extensive experiments on a set of LLMs with different training strategies reveal the limitations of current LLMs in inferring factual knowledge from structured data. We present this benchmark as a compass to navigate the strengths and weaknesses of LLMs in reasoning with structured data for knowledge-sensitive tasks, and to encourage advancements in related real-world applications. Please find our code at https://github.com/EganGu/StructFact.
翻译:大语言模型(LLMs)因其理解和推理事实知识的能力,在各类自然语言处理任务中取得了显著进展。然而,大量事实知识存储于结构化数据中,这类数据具有与预训练所用非结构化文本不同的独特特性。这种差异可能导致难以察觉的推理参数偏差,从而对LLMs有效利用结构化数据并准确推理事实知识构成挑战。为此,我们提出了一个名为StructFact的基准测试,用于评估LLMs在推断事实知识时的结构化推理能力。StructFact包含8,340个事实性问题,涵盖不同任务、领域、时间跨度和地域。该基准使我们能够基于结构化事实的独特特性,从五个事实性任务维度考察LLMs的能力。通过对采用不同训练策略的一系列LLMs进行大量实验,揭示了当前LLMs在从结构化数据推断事实知识方面的局限性。我们提出此基准作为指引,以探明LLMs在面向知识敏感任务的结构化数据推理中的优势与不足,并推动相关实际应用的进展。代码开源地址:https://github.com/EganGu/StructFact。