Large language models (LLMs) demonstrate extraordinary abilities in a wide range of natural language processing (NLP) tasks. In this paper, we show that, beyond text understanding capability, LLMs are capable of processing text layouts that are denoted by spatial markers. They are able to answer questions that require explicit spatial perceiving and reasoning, while a drastic performance drop is observed when the spatial markers from the original data are excluded. We perform a series of experiments with the GPT-3.5, Baichuan2, Llama2 and ChatGLM3 models on various types of layout-sensitive datasets for further analysis. The experimental results reveal that the layout understanding ability of LLMs is mainly introduced by the coding data for pretraining, which is further enhanced at the instruction-tuning stage. In addition, layout understanding can be enhanced by integrating low-cost, auto-generated data approached by a novel text game. Finally, we show that layout understanding ability is beneficial for building efficient visual question-answering (VQA) systems.
翻译:大语言模型(LLMs)在广泛的自然语言处理(NLP)任务中展现出非凡的能力。本文表明,除了文本理解能力外,LLMs还能够处理由空间标记符表示的文本版面布局。它们能够回答需要显式空间感知与推理的问题,而当原始数据中的空间标记符被移除时,我们观察到其性能急剧下降。我们使用GPT-3.5、Baichuan2、Llama2和ChatGLM3模型在多种类型的布局敏感数据集上进行了一系列实验以作进一步分析。实验结果表明,LLMs的版面布局理解能力主要由预训练阶段使用的代码数据引入,并在指令微调阶段得到进一步增强。此外,通过整合一种新颖文本游戏方法生成的低成本自动生成数据,可以增强版面布局理解能力。最后,我们证明了版面布局理解能力对于构建高效的视觉问答(VQA)系统是有益的。