Autoregressive multimodal large language models (MLLMs) enable 3D generation but struggle to scale to high-resolution shapes due to inadequate 3D tokenizations. Compact set-based representations discard deterministic spatial ordering, leading to ambiguous sequence prediction, while uniform or octree-based voxel grids preserve ordering at the cost of severe redundancy and excessively long sequences. This structural trade-off limits stable and efficient autoregressive 3D generation. We present SuperVoxelGPT, a representation-first framework that resolves this tension through adaptive and deterministically ordered supervoxel tokenization. Given a prompt, we first predict a coarse geometric saliency distribution and construct a shape-adaptive supervoxel partition using saliency-guided centroidal Voronoi tessellation, allocating fine-grained cells to complex regions and larger cells to smooth regions. Conditioned on the text and ordered supervoxel layout, we introduce a SuperVoxelVAE and fine-tune a pretrained MLLM to autoregressively generate supervoxel tokens. Experiments on Trellis-500K show that SuperVoxelGPT reduces token sequence length to 12.8% of uniform voxel tokenization while achieving state-of-the-art generation quality and an average 10$\times$ speedup over prior methods.
翻译:自回归多模态大语言模型(MLLMs)能够实现三维生成,但由于三维标记化技术不足,难以扩展至高分辨率形状。基于紧凑集合的表示法舍弃了确定性的空间顺序,导致序列预测模糊;而基于均匀或八叉树体素网格的方法虽保留了顺序,却以严重冗余和过长的序列为代价。这种结构性权衡限制了稳定高效的自回归三维生成。我们提出SuperVoxelGPT,一种以表示为核心的框架,通过自适应且确定性有序的超体素标记化化解这一矛盾。给定提示后,我们首先预测粗略的几何显著性分布,并利用显著性引导的质心Voronoi剖分构建形状自适应的超体素分割,在复杂区域分配精细单元,在平滑区域分配较大单元。基于文本和有序超体素布局,我们引入SuperVoxelVAE并对预训练的MLLM进行微调,使其能够自回归生成超体素标记。在Trellis-500K数据集上的实验表明,SuperVoxelGPT将标记序列长度缩减至均匀体素标记化的12.8%,同时实现了最先进的生成质量,且相较于现有方法平均加速10倍。