Assessing the capabilities of large language models (LLMs) is often challenging, in part, because it is hard to find tasks to which they have not been exposed during training. We take one step to address this challenge by turning to a new task: focusing on symbolic graphics programs, which are a popular representation for graphics content that procedurally generates visual data. LLMs have shown exciting promise towards program synthesis, but do they understand symbolic graphics programs? Unlike conventional programs, symbolic graphics programs can be translated to graphics content. Here, we characterize an LLM's understanding of symbolic programs in terms of their ability to answer questions related to the graphics content. This task is challenging as the questions are difficult to answer from the symbolic programs alone -- yet, they would be easy to answer from the corresponding graphics content as we verify through a human experiment. To understand symbolic programs, LLMs may need to possess the ability to imagine how the corresponding graphics content would look without directly accessing the rendered visual content. We use this task to evaluate LLMs by creating a large benchmark for the semantic understanding of symbolic graphics programs. This benchmark is built via program-graphics correspondence, hence requiring minimal human efforts. We evaluate current LLMs on our benchmark to elucidate a preliminary assessment of their ability to reason about visual scenes from programs. We find that this task distinguishes existing LLMs and models considered good at reasoning perform better. Lastly, we introduce Symbolic Instruction Tuning (SIT) to improve this ability. Specifically, we query GPT4-o with questions and images generated by symbolic programs. Such data are then used to finetune an LLM. We also find that SIT data can improve the general instruction following ability of LLMs.
翻译:评估大型语言模型(LLMs)的能力通常具有挑战性,部分原因在于难以找到其在训练过程中未曾接触过的任务。我们通过转向一项新任务来应对这一挑战:聚焦于符号图形程序,这是一种通过程序化生成视觉数据的流行图形内容表示方法。LLMs在程序合成方面已展现出令人兴奋的潜力,但它们是否理解符号图形程序?与传统程序不同,符号图形程序可被转换为图形内容。在此,我们通过LLMs回答与图形内容相关问题的能力来刻画其对符号程序的理解程度。该任务具有挑战性,因为仅从符号程序本身难以回答这些问题——然而,正如我们通过人类实验所验证的,从对应的图形内容出发则易于回答。要理解符号程序,LLMs可能需要具备在不直接访问渲染视觉内容的情况下,想象对应图形内容呈现样貌的能力。我们利用此任务评估LLMs,创建了一个用于符号图形程序语义理解的大规模基准测试。该基准通过程序-图形对应关系构建,因此仅需极少人工参与。我们在该基准上评估当前LLMs,以初步判断其从程序推理视觉场景的能力。研究发现,该任务能够区分现有LLMs,且被认为擅长推理的模型表现更优。最后,我们引入符号指令微调(SIT)以提升此能力。具体而言,我们使用符号程序生成的问题和图像查询GPT-4o,并利用此类数据对LLM进行微调。我们还发现SIT数据能够提升LLMs的通用指令遵循能力。