In this study, we explore Transformer-based diffusion models for image and video generation. Despite the dominance of Transformer architectures in various fields due to their flexibility and scalability, the visual generative domain primarily utilizes CNN-based U-Net architectures, particularly in diffusion-based models. We introduce GenTron, a family of Generative models employing Transformer-based diffusion, to address this gap. Our initial step was to adapt Diffusion Transformers (DiTs) from class to text conditioning, a process involving thorough empirical exploration of the conditioning mechanism. We then scale GenTron from approximately 900M to over 3B parameters, observing significant improvements in visual quality. Furthermore, we extend GenTron to text-to-video generation, incorporating novel motion-free guidance to enhance video quality. In human evaluations against SDXL, GenTron achieves a 51.1% win rate in visual quality (with a 19.8% draw rate), and a 42.3% win rate in text alignment (with a 42.9% draw rate). GenTron also excels in the T2I-CompBench, underscoring its strengths in compositional generation. We believe this work will provide meaningful insights and serve as a valuable reference for future research.
翻译:在本研究中,我们探索了基于Transformer的扩散模型用于图像与视频生成。尽管Transformer架构因其灵活性和可扩展性在各个领域占据主导地位,但视觉生成领域主要采用基于CNN的U-Net架构,尤其是在扩散模型中。为填补这一空白,我们提出了GenTron系列——一类基于Transformer扩散的生成模型。我们的第一步是将扩散Transformer(DiTs)从类别条件调整为文本条件,这一过程涉及对条件机制进行全面的实证探索。随后,我们将GenTron的参数量从约9亿扩展至超过30亿,观察到视觉质量的显著提升。此外,我们将GenTron拓展至文本到视频生成任务,并引入创新的无运动引导技术以增强视频质量。在与SDXL的人工评估对比中,GenTron在视觉质量上取得了51.1%的胜率(平局率19.8%),在文本对齐上取得了42.3%的胜率(平局率42.9%)。GenTron在T2I-CompBench上也表现优异,凸显了其在组合生成任务中的优势。我们相信,这项工作将为未来研究提供有意义的见解,并成为有价值的参考。