In recent years, image synthesis has achieved remarkable advancements, enabling diverse applications in content creation, virtual reality, and beyond. We introduce a novel approach to image generation using Auto-Regressive (AR) modeling, which leverages a next-detail prediction strategy for enhanced fidelity and scalability. While AR models have achieved transformative success in language modeling, replicating this success in vision tasks has presented unique challenges due to the inherent spatial dependencies in images. Our proposed method addresses these challenges by iteratively adding finer details to an image compositionally, constructing it as a hierarchical combination of base and detail image factors. This strategy is shown to be more effective than the conventional next-token prediction and even surpasses the state-of-the-art next-scale prediction approaches. A key advantage of this method is its scalability to higher resolutions without requiring full model retraining, making it a versatile solution for high-resolution image generation.
翻译:近年来,图像合成技术取得了显著进展,在内容创作、虚拟现实等领域实现了多样化应用。我们提出了一种基于自回归建模的图像生成新方法,该方法利用下一细节预测策略以提升保真度与可扩展性。尽管自回归模型在语言建模领域已取得变革性成功,但由于图像固有的空间依赖性,在视觉任务中复现这一成功面临独特挑战。我们提出的方法通过组合式地迭代添加精细细节来解决这些挑战,将图像构建为基础因子与细节因子的层次化组合。该策略被证明比传统的下一标记预测更有效,甚至超越了最先进的下一尺度预测方法。该方法的关键优势在于能够在不需完整模型重训练的情况下扩展至更高分辨率,从而为高分辨率图像生成提供了一种通用解决方案。