Large language models (LLMs) show remarkable capabilities across a variety of tasks. Despite the models only seeing text in training, several recent studies suggest that LLM representations implicitly capture aspects of the underlying grounded concepts. Here, we explore LLM representations of a particularly salient kind of grounded knowledge -- spatial relationships. We design natural-language navigation tasks and evaluate the ability of LLMs, in particular GPT-3.5-turbo, GPT-4, and Llama2 series models, to represent and reason about spatial structures. These tasks reveal substantial variability in LLM performance across different spatial structures, including square, hexagonal, and triangular grids, rings, and trees. In extensive error analysis, we find that LLMs' mistakes reflect both spatial and non-spatial factors. These findings suggest that LLMs appear to capture certain aspects of spatial structure implicitly, but room for improvement remains.
翻译:大型语言模型(LLM)在多种任务中展现出显著能力。尽管模型在训练过程中仅接触文本,但近期多项研究表明,LLM的表征方式隐式地捕捉了底层具身概念的某些方面。本文聚焦于一类尤为突出的具身知识——空间关系——的LLM表征。我们设计了自然语言导航任务,评估LLM(特别是GPT-3.5-turbo、GPT-4及Llama2系列模型)表征和推理空间结构的能力。这些任务揭示了LLM在不同空间结构(包括方形网格、六边形网格、三角形网格、环形结构和树形结构)中的表现存在显著差异。通过广泛的错误分析,我们发现LLM的错误同时反映了空间因素与非空间因素。这些发现表明,LLM虽能隐式捕捉空间结构的某些特征,但其能力仍有提升空间。