Recent advances in auto-regressive transformers have achieved remarkable success in generative modeling. However, text-to-3D generation remains challenging, primarily due to bottlenecks in learning discrete 3D representations. Specifically, existing approaches often suffer from information loss during encoding, causing representational distortion before the quantization process. This effect is further amplified by vector quantization, ultimately degrading the geometric coherence of text-conditioned 3D shapes. Moreover, the conventional two-stage training paradigm induces an objective mismatch between reconstruction and text-conditioned auto-regressive generation. To address these issues, we propose View-aware Auto-Regressive 3D (VAR-3D), which intergrates a view-aware 3D Vector Quantized-Variational AutoEncoder (VQ-VAE) to convert the complex geometric structure of 3D models into discrete tokens. Additionally, we introduce a rendering-supervised training strategy that couples discrete token prediction with visual reconstruction, encouraging the generative process to better preserve visual fidelity and structural consistency relative to the input text. Experiments demonstrate that VAR-3D significantly outperforms existing methods in both generation quality and text-3D alignment.
翻译:自回归Transformer模型的最新进展在生成建模方面取得了显著成功。然而,文本到3D生成仍然面临挑战,主要源于学习离散3D表示过程中的瓶颈。具体而言,现有方法在编码阶段常出现信息丢失,导致量化过程前产生表示失真。这种效应在向量量化过程中被进一步放大,最终降低了文本条件化3D形状的几何一致性。此外,传统的两阶段训练范式导致重建任务与文本条件化自回归生成任务之间存在目标失配。为解决这些问题,我们提出了视角感知自回归3D模型(VAR-3D),该模型集成了一个视角感知的3D向量量化变分自编码器(VQ-VAE),将复杂的三维模型几何结构转换为离散标记。同时,我们引入了一种渲染监督训练策略,将离散标记预测与视觉重建相结合,促使生成过程更好地保持相对于输入文本的视觉保真度与结构一致性。实验表明,VAR-3D在生成质量和文本-3D对齐方面均显著优于现有方法。