This dissertation attempts to drive innovation in the field of generative modeling for computer vision, by exploring novel formulations of conditional generative models, and innovative applications in images, 3D animations, and video. Our research focuses on architectures that offer reversible transformations of noise and visual data, and the application of encoder-decoder architectures for generative tasks and 3D content manipulation. In all instances, we incorporate conditional information to enhance the synthesis of visual data, improving the efficiency of the generation process as well as the generated content. We introduce the use of Neural ODEs to model video dynamics using an encoder-decoder architecture, demonstrating their ability to predict future video frames despite being trained solely to reconstruct current frames. Next, we propose a conditional variant of continuous normalizing flows that enables higher-resolution image generation based on lower-resolution input, achieving comparable image quality while reducing parameters and training time. Our next contribution presents a pipeline that takes human images as input, automatically aligns a user-specified 3D character with the pose of the human, and facilitates pose editing based on partial inputs. Next, we derive the relevant mathematical details for denoising diffusion models that use non-isotropic Gaussian processes, and show comparable generation quality. Finally, we devise a novel denoising diffusion framework capable of solving all three video tasks of prediction, generation, and interpolation. We perform ablation studies, and show SOTA results on multiple datasets. Our contributions are published articles at peer-reviewed venues. Overall, our research aims to make a meaningful contribution to the pursuit of more efficient and flexible generative models, with the potential to shape the future of computer vision.
翻译:本论文致力于推动计算机视觉领域生成建模的创新,通过探索条件性生成模型的新颖形式及其在图像、3D动画和视频中的创新应用。研究聚焦于实现噪声与视觉数据可逆变换的架构,以及编码器-解码器架构在生成任务和3D内容操作中的应用。在所有案例中,我们引入条件信息以增强视觉数据的合成效果,既提升了生成过程的效率,也优化了生成内容的质量。首先,我们提出使用神经常微分方程结合编码器-解码器架构对视频动态进行建模,证明该模型尽管仅以重建当前帧为训练目标,仍能预测未来视频帧。其次,我们提出一种连续归一化流的条件变体,能够基于低分辨率输入生成高分辨率图像,在保持图像质量的同时减少参数量和训练时间。第三项贡献提出了一种流水线:输入人类图像后,自动将用户指定的3D角色与人类姿态对齐,并支持基于部分输入进行姿态编辑。随后,我们推导了使用非各向同性高斯过程的去噪扩散模型的相关数学细节,并展示了相当的生成质量。最后,我们设计了一种新型去噪扩散框架,能够同时解决视频预测、生成和插值三项任务。通过消融实验,我们在多个数据集上取得了最优结果。相关成果已在同行评审会议或期刊发表。总体而言,本研究旨在为追求更高效、更灵活的生成模型做出实质性贡献,并有望塑造计算机视觉的未来发展方向。