Large language models (LLMs) have recently revolutionized automated text understanding and generation. The performance of these models relies on the high number of parameters of the underlying neural architectures, which allows LLMs to memorize part of the vast quantity of data seen during the training. This paper investigates whether and to what extent general-purpose pre-trained LLMs have memorized information from known ontologies. Our results show that LLMs partially know ontologies: they can, and do indeed, memorize concepts from ontologies mentioned in the text, but the level of memorization of their concepts seems to vary proportionally to their popularity on the Web, the primary source of their training material. We additionally propose new metrics to estimate the degree of memorization of ontological information in LLMs by measuring the consistency of the output produced across different prompt repetitions, query languages, and degrees of determinism.
翻译:大型语言模型(LLM)近期彻底革新了自动文本理解与生成领域。这些模型的性能依赖于其底层神经架构中的大量参数,这使得LLM能够记忆训练过程中所接触的海量数据。本文探讨通用预训练LLM是否以及能在多大程度上记忆已知本体中的信息。结果表明,LLM仅部分掌握本体知识:它们确实能够记忆文本中提及的本体概念,但概念的记忆程度似乎与其在网络(主要训练数据来源)上的流行度呈正比。此外,我们提出了新评估指标,通过测量不同提示重复次数、查询语言及确定性程度下输出结果的一致性,来估算LLM对本体内信息的记忆程度。