Vision-language models (VLMs) have been explored for visual programming, where they generate code to solve visual tasks. However, most prior work focuses on visual programming for productivity; it remains unclear how well current VLMs perform on education-oriented visual programming and what factors limit their performance. To bridge this gap, we introduce TurtleAI, a benchmark containing 823 tasks curated based on real-world visual programming tasks in the Turtle Graphics domain. Solving these tasks requires models to perceive geometric patterns, reason about spatial relationships, and synthesize Python code that faithfully reproduces geometric patterns. We evaluate 20+ VLMs, including GPT-5, GPT-4o, and Qwen2-VL-72B, and find that they struggle significantly, with most achieving success rates below 30%. To address these limitations, we propose a data generation technique that requires only a small set of seed samples. Fine-tuning Qwen2-VL-72B on the resulting synthetic data yields an improvement of about 20% on real-world tasks. Our failure analysis reveals that GPT-4o struggles with spatial reasoning and precise visual replication, whereas fine-tuning primarily improves the alignment between visual reasoning and code implementation.
翻译:视觉语言模型已被探索用于视觉编程领域,即通过生成代码来解决视觉任务。然而,现有研究大多聚焦于提升生产力的视觉编程,尚不明确当前视觉语言模型在教育导向的视觉编程中表现如何,以及限制其性能的因素。为填补这一空白,我们提出TurtleAI基准测试,包含基于海龟绘图领域真实视觉编程任务筛选的823个任务。解决这些任务要求模型能够感知几何图案、推理空间关系并合成能忠实复现几何图案的Python代码。我们对包括GPT-5、GPT-4o和Qwen2-VL-72B在内的20余个视觉语言模型进行评估,发现其表现显著受限,多数模型成功率低于30%。针对上述局限,我们提出一种仅需少量种子样本的数据生成技术。基于该合成数据微调Qwen2-VL-72B后,其在真实任务上的性能提升约20%。故障分析表明,GPT-4o在空间推理与精确视觉复现方面存在困难,而微调主要改进了视觉推理与代码实现的对齐性。