DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation

Recent Diffusion Transformers (e.g., DiT) have demonstrated their powerful effectiveness in generating high-quality 2D images. However, it is still being determined whether the Transformer architecture performs equally well in 3D shape generation, as previous 3D diffusion methods mostly adopted the U-Net architecture. To bridge this gap, we propose a novel Diffusion Transformer for 3D shape generation, namely DiT-3D, which can directly operate the denoising process on voxelized point clouds using plain Transformers. Compared to existing U-Net approaches, our DiT-3D is more scalable in model size and produces much higher quality generations. Specifically, the DiT-3D adopts the design philosophy of DiT but modifies it by incorporating 3D positional and patch embeddings to adaptively aggregate input from voxelized point clouds. To reduce the computational cost of self-attention in 3D shape generation, we incorporate 3D window attention into Transformer blocks, as the increased 3D token length resulting from the additional dimension of voxels can lead to high computation. Finally, linear and devoxelization layers are used to predict the denoised point clouds. In addition, our transformer architecture supports efficient fine-tuning from 2D to 3D, where the pre-trained DiT-2D checkpoint on ImageNet can significantly improve DiT-3D on ShapeNet. Experimental results on the ShapeNet dataset demonstrate that the proposed DiT-3D achieves state-of-the-art performance in high-fidelity and diverse 3D point cloud generation. In particular, our DiT-3D decreases the 1-Nearest Neighbor Accuracy of the state-of-the-art method by 4.59 and increases the Coverage metric by 3.51 when evaluated on Chamfer Distance.

翻译：近期扩散Transformer（如DiT）在高质量二维图像生成中展现出强大效能。然而，三维形状生成领域是否同样适用Transformer架构尚待验证——现有三维扩散方法多采用U-Net架构。为弥合这一研究空白，我们提出面向三维形状生成的新型扩散Transformer模型DiT-3D，该模型可直接在体素化点云上运用朴素Transformer执行去噪过程。相较于现有U-Net方法，DiT-3D在模型规模扩展性方面更具优势，可生成质量显著更优的结果。具体而言，DiT-3D继承DiT的设计理念，通过融入三维位置编码与分块嵌入机制，自适应聚合体素化点云输入特征。为降低三维形状生成中自注意力的计算开销，我们引入三维窗口注意力机制至Transformer模块——体素额外维度导致的三维令牌长度增长会引发高额计算量。最终采用线性层与反体素化层预测去噪后的点云。此外，本Transformer架构支持从二维到三维的高效微调：基于ImageNet预训练的DiT-2D检查点可显著提升在ShapeNet数据集上的DiT-3D性能。在ShapeNet数据集上的实验表明，本文提出的DiT-3D在高保真度、多样性三维点云生成任务中达到最优水平。特别值得注意的是，与现有最优方法相比，基于倒角距离评估时，DiT-3D将一近邻准确率降低4.59，同时将覆盖率指标提升3.51。