Autoregressive multimodal large language models (MLLMs) enable 3D generation but struggle to scale to high-resolution shapes due to inadequate 3D tokenizations. Compact set-based representations discard deterministic spatial ordering, leading to ambiguous sequence prediction, while uniform or octree-based voxel grids preserve ordering at the cost of severe redundancy and excessively long sequences. This structural trade-off limits stable and efficient autoregressive 3D generation. We present SuperVoxelGPT, a representation-first framework that resolves this tension through adaptive and deterministically ordered supervoxel tokenization. Given a prompt, we first predict a coarse geometric saliency distribution and construct a shape-adaptive supervoxel partition using saliency-guided centroidal Voronoi tessellation, allocating fine-grained cells to complex regions and larger cells to smooth regions. Conditioned on the text and ordered supervoxel layout, we introduce a SuperVoxelVAE and fine-tune a pretrained MLLM to autoregressively generate supervoxel tokens. Experiments on Trellis-500K show that SuperVoxelGPT reduces token sequence length to 12.8% of uniform voxel tokenization while achieving state-of-the-art generation quality and an average 10$\times$ speedup over prior methods.
翻译:自回归多模态大语言模型(MLLMs)虽能实现三维生成,但由于三维标记化方案的不足,难以扩展至高分辨率形状。紧致的基于集合的表示丢弃了确定性空间顺序,导致序列预测模糊;而均匀或八叉树体素网格虽保留了顺序性,却以严重冗余和超长序列为代价。这种结构性权衡限制了稳定高效的自回归三维生成。我们提出SuperVoxelGPT——一种以表示为先的框架,通过自适应且确定性有序的超体素标记化解构该矛盾。给定提示后,我们首先预测粗略几何显著性分布,并利用显著性引导的质心Voronoi剖分构建形状自适应超体素划分——复杂区域分配精细单元格,平滑区域分配较大单元格。基于文本条件与有序超体素布局,我们引入SuperVoxelVAE并对预训练MLLM进行微调,使其能够自回归生成超体素标记。在Trellis-500K上的实验表明,SuperVoxelGPT将标记序列长度缩减至均匀体素标记化的12.8%,同时达到最先进的生成质量,推理速度较先前方法平均提升10倍。