3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

Procedural 3D modeling through code is emerging as a versatile paradigm, offering deterministic, engine-ready, and precisely editable assets that neural 3D generators inherently lack. Authoring such procedural content, however, demands deep expertise in 3D software APIs, parametric design, and code-level geometric reasoning. In this paper, we propose 3DCodeBench, a systematic benchmark for evaluating vision-language model (VLM) agents for procedural 3D generation in 3D modeling software. Specifically, 3DCodeBench evaluates how effectively 12 advanced VLMs can serve as procedural 3D modelers by translating text and image references into procedural code for 3D modeling software. Recognizing that automated metrics may not fully capture the perceptual quality of 3D shapes, we build 3DCodeArena, a ranking platform based on pairwise human preferences over generated 3D outputs. From extensive evaluations and results, we observe that: (1) Failures mostly arise from API mismatches, while successful renders still suffer from disconnected or floating 3D geometric components. (2) Test-time scaling, such as higher thinking budgets and multi-turn refinement, improves performance overall. Our findings highlight a critical need for high-quality procedural coding data to advance commercial VLMs. Furthermore, effective procedural 3D modeling requires a robust execution environment that provides high-fidelity feedback for iterative refinement. We release 3DCodeBench, including the curated large-scale dataset of multimodal (text/image) prompts, procedural code, 3D object triplets, evaluation protocol, and the public 3DCodeArena platform as a foundational toolkit for exploring VLM-based procedural 3D modelers.

翻译：通过代码进行程序化3D建模正成为一种多功能范式，能够生成确定性、可直接用于引擎且可精确编辑的资产，而神经3D生成器本质上缺乏这些特性。然而，编写此类程序化内容需要深入掌握3D软件API、参数化设计以及代码级几何推理能力。本文提出3DCodeBench，一个系统化基准测试，用于评估视觉语言模型（VLM）智能体在3D建模软件中执行程序化3D生成任务的能力。具体而言，3DCodeBench评估12种先进VLM作为程序化3D建模器，将文本和图像参考转换为3D建模软件程序化代码的效率。考虑到自动化指标可能无法完全捕捉3D形状的感知质量，我们构建了3DCodeArena——一个基于成对人工偏好对生成3D输出进行排名的平台。通过广泛评估与结果分析，我们观察到：（1）失败主要源于API不匹配，而成功渲染的模型仍存在断开或悬浮的3D几何组件问题；（2）测试时扩展策略（如更高思考预算和多轮优化）可整体提升性能。我们的发现凸显了高质量程序化编码数据对推动商用VLM发展的关键需求。此外，有效的程序化3D建模需要稳健的执行环境，为迭代优化提供高保真反馈。我们发布了3DCodeBench，包括精心策划的大规模多模态（文本/图像）提示集、程序化代码、3D对象三元组、评估协议及公开的3DCodeArena平台，作为探索基于VLM的程序化3D建模器的基础工具包。