Against the backdrop of enthusiasm for large language models (LLMs), there is an urgent need to scientifically assess their capabilities and shortcomings. This is nontrivial in part because it is difficult to find tasks which the models have not encountered during training. Utilizing symbolic graphics programs, we propose a domain well-suited to test multiple spatial-semantic reasoning skills of LLMs. Popular in computer graphics, these programs procedurally generate visual data. While LLMs exhibit impressive skills in general program synthesis and analysis, symbolic graphics programs offer a new layer of evaluation: they allow us to test an LLM's ability to answer different-grained semantic-level questions of the images or 3D geometries without a vision encoder. To semantically understand the symbolic programs, LLMs would need to possess the ability to "imagine" and reason how the corresponding graphics content would look with only the symbolic description. We use this task to evaluate LLMs by creating a large benchmark for the semantic visual understanding of symbolic graphics programs, built procedurally with minimal human effort. Particular emphasis is placed on transformations of images that leave the image level semantics invariant while introducing significant changes to the underlying program. We evaluate commercial and open-source LLMs on our benchmark to assess their ability to reason about visual output of programs, finding that LLMs considered stronger at reasoning generally perform better. Lastly, we introduce a novel method to improve this ability -- Symbolic Instruction Tuning (SIT), in which the LLM is finetuned with pre-collected instruction data on symbolic graphics programs. Interestingly, we find that SIT not only improves LLM's understanding on symbolic programs, but it also improves general reasoning ability on various other benchmarks.
翻译:在大语言模型(LLM)热潮的背景下,科学评估其能力与不足的需求日益迫切。这并非易事,部分原因在于难以找到模型在训练中未曾接触过的任务。利用符号图形程序,我们提出了一个非常适合测试LLM多种空间语义推理能力的领域。这些程序在计算机图形学中广泛应用,通过过程化方式生成视觉数据。尽管LLM在通用程序合成与分析方面展现出令人印象深刻的能力,但符号图形程序提供了一个新的评估维度:它们使我们能够测试LLM在无需视觉编码器的情况下,回答关于图像或三维几何体不同粒度语义级问题的能力。要语义化地理解符号程序,LLM需要具备仅凭符号描述就能"想象"并推理相应图形内容外观的能力。我们利用此任务评估LLM,创建了一个用于符号图形程序语义视觉理解的大规模基准测试集,该数据集通过过程化方式构建,仅需极少人工参与。特别关注的是那些保持图像层面语义不变,却对底层程序引入显著变化的图像变换。我们在基准测试集上评估了商业和开源LLM,以考察其对程序视觉输出的推理能力,发现通常被认为推理能力更强的LLM表现更优。最后,我们提出了一种提升该能力的新方法——符号指令微调(SIT),即使用预先收集的符号图形程序指令数据对LLM进行微调。有趣的是,我们发现SIT不仅提升了LLM对符号程序的理解能力,还提高了其在其他多种基准测试中的通用推理能力。