Large language models (LLMs) are the result of a massive experiment in bottom-up, data-driven reverse engineering of language at scale. Despite their utility in a number of downstream NLP tasks, ample research has shown that LLMs are incapable of performing reasoning in tasks that require quantification over and the manipulation of symbolic variables (e.g., planning and problem solving); see for example [25][26]. In this document, however, we will focus on testing LLMs for their language understanding capabilities, their supposed forte. As we will show here, the language understanding capabilities of LLMs have been widely exaggerated. While LLMs have proven to generate human-like coherent language (since that's how they were designed), their language understanding capabilities have not been properly tested. In particular, we believe that the language understanding capabilities of LLMs should be tested by performing an operation that is the opposite of 'text generation' and specifically by giving the LLM snippets of text as input and then querying what the LLM "understood". As we show here, when doing so it will become apparent that LLMs do not truly understand language, beyond very superficial inferences that are essentially the byproduct of the memorization of massive amounts of ingested text.
翻译:大型语言模型(LLMs)是一场大规模、自底向上、数据驱动的语言逆向工程实验的产物。尽管它们在下游多项自然语言处理任务中具有实用性,大量研究表明,LLMs无法在执行需要量化及操作符号变量的任务(例如规划与问题求解)时进行推理;参见文献[25][26]。然而,本文档将重点测试LLMs的语言理解能力——这被认为是它们的强项。正如我们将要展示的,LLMs的语言理解能力被广泛高估了。虽然LLMs已被证明能够生成类人的连贯语言(因其设计初衷即在于此),但其语言理解能力尚未得到恰当检验。我们特别认为,应当通过执行与"文本生成"相反的操作来测试LLMs的语言理解能力,具体而言:向LLM输入文本片段,继而查询模型"理解"了什么。正如本文所示,当采用这种方法时,将明显发现LLMs并不真正理解语言——除了极其表面的推断,这些推断本质上是海量文本记忆的副产品。