Most recent advances in 3D generative modeling rely on diffusion or flow-matching formulations. We instead explore a fully autoregressive alternative and introduce GaussianGPT, a transformer-based model that directly generates 3D Gaussians via next-token prediction, thus facilitating full 3D scene generation. We first compress Gaussian primitives into a discrete latent grid using a sparse 3D convolutional autoencoder with vector quantization. The resulting tokens are serialized and modeled using a causal transformer with 3D rotary positional embedding, enabling sequential generation of spatial structure and appearance. Unlike diffusion-based methods that refine scenes holistically, our formulation constructs scenes step-by-step, naturally supporting completion, outpainting, controllable sampling via temperature, and flexible generation horizons. This formulation leverages the compositional inductive biases and scalability of autoregressive modeling while operating on explicit representations compatible with modern neural rendering pipelines, positioning autoregressive transformers as a complementary paradigm for controllable and context-aware 3D generation.
翻译:近期在3D生成建模方面的最新进展主要依赖于扩散或流匹配公式。我们转而探索一种完全自回归的替代方案,并提出GaussianGPT——一种基于Transformer的模型,通过下一个词元预测直接生成3D高斯体,从而促进完整3D场景的生成。我们首先利用带有向量量化的稀疏3D卷积自编码器,将高斯基元压缩到离散潜在网格中。所得词元被序列化,并通过带有3D旋转位置编码的因果Transformer进行建模,从而支持空间结构和外观的序列生成。与整体优化场景的扩散方法不同,我们的公式逐步构建场景,自然地支持补全、外推、通过温度实现可控采样以及灵活的生成范围。该公式利用了自回归建模的组成性归纳偏置和可扩展性,同时操作于与现代神经渲染管线兼容的显式表示,从而将自回归Transformer定位为可控和上下文感知3D生成的互补范式。