Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional latent tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges. In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. CubiD performs fine-grained masking throughout the high-dimensional discrete representation -- any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions, with the number of generation steps fixed at $T$ regardless of feature dimensionality, where $T \ll hwd$. On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks. We hope this work will inspire future research toward unified multimodal architectures. Code is available at: https://github.com/YuqingWang1029/CubiD.
翻译:基于离散令牌的视觉生成因其能够与语言模型共享统一的令牌预测范式而备受关注,有望实现无缝多模态架构。然而,当前离散生成方法仍局限于低维潜在令牌(通常为8-32维),牺牲了理解任务所需的语义丰富性。尽管高维预训练表示(768-1024维)可弥补这一差距,但其离散生成面临根本性挑战。本文提出三次离散扩散(CubiD),这是首个针对高维表示的离散生成模型。CubiD在整个高维离散表示中执行细粒度掩码——任意位置的任意维度均可被掩码并根据局部观测进行预测。这使得模型能够学习空间位置内部及跨空间位置的丰富关联,且生成步数固定为$T$(与特征维度无关,$T \ll hwd$)。在ImageNet-256数据集上,CubiD实现了最先进的离散生成效果,并在900M至3.7B参数范围内展现出强扩展性。关键在于,我们验证了这些离散化令牌保留了原始表示能力,证明相同离散令牌可同时有效服务于理解与生成任务。希望本研究能启发未来对统一多模态架构的探索。代码开源地址:https://github.com/YuqingWang1029/CubiD。