Arbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural Decoder

Super-resolution (SR) and image generation are important tasks in computer vision and are widely adopted in real-world applications. Most existing methods, however, generate images only at fixed-scale magnification and suffer from over-smoothing and artifacts. Additionally, they do not offer enough diversity of output images nor image consistency at different scales. Most relevant work applied Implicit Neural Representation (INR) to the denoising diffusion model to obtain continuous-resolution yet diverse and high-quality SR results. Since this model operates in the image space, the larger the resolution of image is produced, the more memory and inference time is required, and it also does not maintain scale-specific consistency. We propose a novel pipeline that can super-resolve an input image or generate from a random noise a novel image at arbitrary scales. The method consists of a pretrained auto-encoder, a latent diffusion model, and an implicit neural decoder, and their learning strategies. The proposed method adopts diffusion processes in a latent space, thus efficient, yet aligned with output image space decoded by MLPs at arbitrary scales. More specifically, our arbitrary-scale decoder is designed by the symmetric decoder w/o up-scaling from the pretrained auto-encoder, and Local Implicit Image Function (LIIF) in series. The latent diffusion process is learnt by the denoising and the alignment losses jointly. Errors in output images are backpropagated via the fixed decoder, improving the quality of output images. In the extensive experiments using multiple public benchmarks on the two tasks i.e. image super-resolution and novel image generation at arbitrary scales, the proposed method outperforms relevant methods in metrics of image quality, diversity and scale consistency. It is significantly better than the relevant prior-art in the inference speed and memory usage.

翻译：超分辨率（SR）与图像生成是计算机视觉领域的重要任务，在现实应用中广泛采用。然而，现有方法大多仅能生成固定倍率放大图像，存在过度平滑和伪影问题；此外，它们在输出多样性及不同尺度下的图像一致性方面表现不足。相关研究将隐式神经表示（INR）引入去噪扩散模型，以实现连续分辨率、多样且高质量的SR结果。由于该模型在图像空间运行，生成图像的解析度越大，所需内存和推理时间越多，且无法保持尺度特异性的一致性。我们提出一种新型流水线，既可对输入图像进行超分辨率处理，也能从随机噪声生成任意尺度的新图像。该方法包含预训练自编码器、潜在扩散模型、隐式神经解码器及其学习策略。所提方法在潜在空间中执行扩散过程，因此效率高，同时通过多层感知机（MLPs）在任意尺度解码输出图像空间，保持两者对齐。具体而言，我们的任意尺度解码器由预训练自编码器的对称解码器（无上采样模块）与局部隐式图像函数（LIIF）串联构成。潜在扩散过程通过联合优化去噪损失和对齐损失进行学习，输出图像误差通过固定解码器反向传播，从而提升输出质量。在图像超分辨率与任意尺度新图像生成两个任务的多个公开基准测试中，本方法在图像质量、多样性和尺度一致性指标上均优于相关方法，且在推理速度和内存使用方面显著优于现有前沿技术。