The enduring inability of image generative models to recreate intricate geometric features, such as those present in human hands and fingers has been an ongoing problem in image generation for nearly a decade. While strides have been made by increasing model sizes and diversifying training datasets, this issue remains prevalent across all models, from denoising diffusion models to Generative Adversarial Networks (GAN), pointing to a fundamental shortcoming in the underlying architectures. In this paper, we demonstrate how this problem can be mitigated by augmenting convolution layers geometric capabilities through providing them with a single input channel incorporating the relative $n$-dimensional Cartesian coordinate system. We show that this drastically improves quality of hand and face images generated by GANs and Variational AutoEncoders (VAE).
翻译:图像生成模型长期无法再现人类手部和手指等复杂几何特征,这一问题在过去近十年中持续困扰着图像生成领域。尽管通过扩大模型规模和丰富训练数据集取得了一定进展,但该问题在去噪扩散模型到生成对抗网络(GAN)等各类模型中依然普遍存在,这表明底层架构存在根本性缺陷。本文通过向卷积层提供融入相对$n$维笛卡尔坐标系的单一输入通道来增强其几何建模能力,从而证明该问题可得到显著缓解。我们展示该方法大幅提升了GAN和变分自编码器(VAE)生成的手部和面部图像质量。