Large language models (LLMs) demonstrate extraordinary abilities in a wide range of natural language processing (NLP) tasks. In this paper, we show that, beyond text understanding capability, LLMs are capable of processing text layouts that are denoted by spatial markers. They are able to answer questions that require explicit spatial perceiving and reasoning, while a drastic performance drop is observed when the spatial markers from the original data are excluded. We perform a series of experiments with the GPT-3.5, Baichuan2, Llama2 and ChatGLM3 models on various types of layout-sensitive datasets for further analysis. The experimental results reveal that the layout understanding ability of LLMs is mainly introduced by the coding data for pretraining, which is further enhanced at the instruction-tuning stage. In addition, layout understanding can be enhanced by integrating low-cost, auto-generated data approached by a novel text game. Finally, we show that layout understanding ability is beneficial for building efficient visual question-answering (VQA) systems.
翻译:大语言模型(LLMs)在广泛的自然语言处理(NLP)任务中展现出非凡的能力。本文表明,除了文本理解能力外,LLMs还能够处理由空间标记符表示的文本布局。它们能够回答需要显式空间感知与推理的问题,而当原始数据中的空间标记符被移除时,其性能会急剧下降。我们使用GPT-3.5、Baichuan2、Llama2和ChatGLM3模型在多种类型的布局敏感数据集上进行了一系列实验以进行深入分析。实验结果表明,LLMs的布局理解能力主要由预训练阶段的代码数据引入,并在指令微调阶段得到进一步增强。此外,通过整合一种新颖文本游戏方法生成的低成本自动生成数据,可以增强布局理解能力。最后,我们证明了布局理解能力有助于构建高效的视觉问答(VQA)系统。