P3D-Bench: Benchmarking MLLMs for Parametric 3D Generation and Structural Reasoning

Multimodal large language models can write code to produce complex programs as well as use programs to do 3D modeling, which opens up a new avenue for 3D generation powered by their priors, world knowledge and reasoning. Yet existing benchmarks rarely evaluate 3D modeling through code. Such modeling demands more than runnable code: from a text or visual specification, a model must generate a parametric 3D program that is geometrically precise, semantically aligned and assembly-consistent. We introduce P3D-Bench, a benchmark for parametric 3D generation. Unlike a 3D mesh, a parametric 3D program exposes explicit dimensions, construction operations and part relations, revealing whether a model recovers a design's structure, not just its appearance. Under a unified protocol, P3D-Bench covers three task families (Text-to-3D, Image-to-3D and Assembly-3D) and scores each output for executability, geometric fidelity, topology, text-grounded constraints, multiview semantic alignment and part-level structure. We evaluate frontier MLLMs and text-only LLMs on 400 text cases, 400 image cases and 203 annotated assemblies, with domain-specific models as reference points. Our extensive evaluation yields three findings. First, assemblies are the hardest setting, where models still fail to compose multiple parts into a coherent structure. Second, models can often recover the global shape and semantic identity of the target object, yet fail to reproduce the precise parametric geometry specified by the input. Third, part-level modeling remains weak on assemblies, where models recover neither the geometry of each part nor the right number of parts. These results position P3D-Bench as a benchmark for evaluating precise parametric geometry and part-level structure in parametric 3D generation.

翻译：多模态大语言模型能够编写代码以生成复杂程序，并利用程序进行三维建模，这为借助其先验知识、世界知识与推理能力的三维生成开辟了新途径。然而，现有基准测试鲜少通过代码评估三维建模能力。此类建模对代码的要求超越可执行性：模型需根据文本或视觉规范生成一个参数化三维程序，该程序需在几何上精确、语义上对齐且装配一致。我们提出P3D-Bench——一个面向参数化三维生成的基准测试。不同于三维网格，参数化三维程序显式暴露了尺寸、构造操作与零件关系，从而揭示模型是否恢复设计结构而不仅是外观。在统一协议下，P3D-Bench涵盖三类任务（文本到三维、图像到三维、装配到三维），并从可执行性、几何保真度、拓扑结构、文本约束对齐、多视图语义对齐及零件级结构等方面对各输出进行评分。我们在400个文本案例、400个图像案例及203个标注装配体上评估了前沿多模态大语言模型和纯文本大语言模型，并以领域专用模型作为参考基准。大规模评估揭示了三项发现：首先，装配任务是最具挑战的设置，模型仍难以将多个零件组合成连贯结构；其次，模型常能恢复目标物体的全局形状与语义身份，却难以再现输入所指定的精确参数化几何；最后，模型在装配场景中的零件级建模表现薄弱，既无法恢复各零件几何也无法准确识别零件数量。这些结果使P3D-Bench成为评估参数化三维生成中精确几何与零件级结构的重要基准。