Conditional Generative Modeling for Images, 3D Animations, and Video

This dissertation attempts to drive innovation in the field of generative modeling for computer vision, by exploring novel formulations of conditional generative models, and innovative applications in images, 3D animations, and video. Our research focuses on architectures that offer reversible transformations of noise and visual data, and the application of encoder-decoder architectures for generative tasks and 3D content manipulation. In all instances, we incorporate conditional information to enhance the synthesis of visual data, improving the efficiency of the generation process as well as the generated content. We introduce the use of Neural ODEs to model video dynamics using an encoder-decoder architecture, demonstrating their ability to predict future video frames despite being trained solely to reconstruct current frames. Next, we propose a conditional variant of continuous normalizing flows that enables higher-resolution image generation based on lower-resolution input, achieving comparable image quality while reducing parameters and training time. Our next contribution presents a pipeline that takes human images as input, automatically aligns a user-specified 3D character with the pose of the human, and facilitates pose editing based on partial inputs. Next, we derive the relevant mathematical details for denoising diffusion models that use non-isotropic Gaussian processes, and show comparable generation quality. Finally, we devise a novel denoising diffusion framework capable of solving all three video tasks of prediction, generation, and interpolation. We perform ablation studies, and show SOTA results on multiple datasets. Our contributions are published articles at peer-reviewed venues. Overall, our research aims to make a meaningful contribution to the pursuit of more efficient and flexible generative models, with the potential to shape the future of computer vision.

翻译：本论文致力于推动计算机视觉领域生成建模的创新，通过探索条件性生成模型的新颖形式及其在图像、3D动画和视频中的创新应用。研究聚焦于实现噪声与视觉数据可逆变换的架构，以及编码器-解码器架构在生成任务和3D内容操作中的应用。在所有案例中，我们引入条件信息以增强视觉数据的合成效果，既提升了生成过程的效率，也优化了生成内容的质量。首先，我们提出使用神经常微分方程结合编码器-解码器架构对视频动态进行建模，证明该模型尽管仅以重建当前帧为训练目标，仍能预测未来视频帧。其次，我们提出一种连续归一化流的条件变体，能够基于低分辨率输入生成高分辨率图像，在保持图像质量的同时减少参数量和训练时间。第三项贡献提出了一种流水线：输入人类图像后，自动将用户指定的3D角色与人类姿态对齐，并支持基于部分输入进行姿态编辑。随后，我们推导了使用非各向同性高斯过程的去噪扩散模型的相关数学细节，并展示了相当的生成质量。最后，我们设计了一种新型去噪扩散框架，能够同时解决视频预测、生成和插值三项任务。通过消融实验，我们在多个数据集上取得了最优结果。相关成果已在同行评审会议或期刊发表。总体而言，本研究旨在为追求更高效、更灵活的生成模型做出实质性贡献，并有望塑造计算机视觉的未来发展方向。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日