Masked Generative Image Transformers (MaskGIT) have emerged as a scalable and efficient image generation framework, able to deliver high-quality visuals with low inference costs. However, MaskGIT's token unmasking scheduler, an essential component of the framework, has not received the attention it deserves. We analyze the sampling objective in MaskGIT, based on the mutual information between tokens, and elucidate its shortcomings. We then propose a new sampling strategy based on our Halton scheduler instead of the original Confidence scheduler. More precisely, our method selects the token's position according to a quasi-random, low-discrepancy Halton sequence. Intuitively, that method spreads the tokens spatially, progressively covering the image uniformly at each step. Our analysis shows that it allows reducing non-recoverable sampling errors, leading to simpler hyper-parameters tuning and better quality images. Our scheduler does not require retraining or noise injection and may serve as a simple drop-in replacement for the original sampling strategy. Evaluation of both class-to-image synthesis on ImageNet and text-to-image generation on the COCO dataset demonstrates that the Halton scheduler outperforms the Confidence scheduler quantitatively by reducing the FID and qualitatively by generating more diverse and more detailed images. Our code is at https://github.com/valeoai/Halton-MaskGIT.
翻译:掩码生成式图像Transformer(MaskGIT)已成为一种可扩展且高效的图像生成框架,能够以较低的推理成本生成高质量视觉内容。然而,作为该框架核心组成部分的掩码令牌揭示调度机制尚未得到应有的关注。本文基于令牌间的互信息分析了MaskGIT的采样目标,并阐明其固有缺陷。随后,我们提出一种基于Halton调度器的新型采样策略以替代原有的置信度调度器。具体而言,我们的方法依据准随机、低差异的Halton序列选择令牌位置。直观而言,该方法在空间上分散令牌分布,逐步实现每一步对图像的均匀覆盖。分析表明,该策略能有效减少不可恢复的采样误差,从而简化超参数调优过程并提升图像质量。本调度器无需重新训练或噪声注入,可直接作为原始采样策略的即插即用替代方案。在ImageNet数据集上的类别到图像合成任务及COCO数据集上的文本到图像生成任务评估中,Halton调度器在定量指标上通过降低FID值超越置信度调度器,在定性评估中能生成更具多样性和细节表现的图像。相关代码已发布于https://github.com/valeoai/Halton-MaskGIT。