The enduring inability of image generative models to recreate intricate geometric features, such as those present in human hands and fingers has been an ongoing problem in image generation for nearly a decade. While strides have been made by increasing model sizes and diversifying training datasets, this issue remains prevalent across all models, from denoising diffusion models to Generative Adversarial Networks (GAN), pointing to a fundamental shortcoming in the underlying architectures. In this paper, we demonstrate how this problem can be mitigated by augmenting convolution layers geometric capabilities through providing them with a single input channel incorporating the relative n-dimensional Cartesian coordinate system. We show this drastically improves quality of images generated by Diffusion Models, GANs, and Variational AutoEncoders (VAE).
翻译:图像生成模型在重现复杂几何特征(如人类手部和手指的细节)方面长期存在的不足,已成为近十年来图像生成领域持续存在的问题。尽管通过增大模型规模和多样化训练数据集已取得进展,但该问题在所有模型中依然普遍存在——从去噪扩散模型到生成对抗网络(GAN)——这揭示了底层架构的根本缺陷。本文论证了如何通过为卷积层提供包含相对n维笛卡尔坐标系的单一输入通道来增强其几何能力,从而缓解该问题。实验表明,该方法显著提升了扩散模型、GAN及变分自编码器(VAE)生成图像的质量。