Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models

We address the problem of multi-object 3D pose control in image diffusion models. Instead of conditioning on a sequence of text tokens, we propose to use a set of per-object representations, Neural Assets, to control the 3D pose of individual objects in a scene. Neural Assets are obtained by pooling visual representations of objects from a reference image, such as a frame in a video, and are trained to reconstruct the respective objects in a different image, e.g., a later frame in the video. Importantly, we encode object visuals from the reference image while conditioning on object poses from the target frame. This enables learning disentangled appearance and pose features. Combining visual and 3D pose representations in a sequence-of-tokens format allows us to keep the text-to-image architecture of existing models, with Neural Assets in place of text tokens. By fine-tuning a pre-trained text-to-image diffusion model with this information, our approach enables fine-grained 3D pose and placement control of individual objects in a scene. We further demonstrate that Neural Assets can be transferred and recomposed across different scenes. Our model achieves state-of-the-art multi-object editing results on both synthetic 3D scene datasets, as well as two real-world video datasets (Objectron, Waymo Open).

翻译：我们解决了图像扩散模型中的多目标三维姿态控制问题。不同于基于文本标记序列的条件生成，我们提出使用一组针对每个目标的表示——神经资产——来控制场景中各个目标的三维姿态。神经资产通过从参考图像（如视频帧）中池化目标的视觉表示获得，并经过训练以在不同图像（例如视频的后续帧）中重建相应目标。重要的是，我们在编码参考图像中目标视觉信息的同时，以目标帧中的目标姿态作为条件。这使得模型能够学习解耦的外观与姿态特征。将视觉表示与三维姿态表示以标记序列格式结合，使我们能够保持现有模型的文本到图像架构，仅需用神经资产替代文本标记。通过对预训练的文本到图像扩散模型进行基于此信息的微调，我们的方法实现了对场景中单个目标的细粒度三维姿态与位置控制。我们进一步证明神经资产可在不同场景间迁移与重组。该模型在合成三维场景数据集以及两个真实世界视频数据集（Objectron、Waymo Open）上均取得了最先进的多目标编辑效果。

相关内容

ASSETS

关注 0

ACM SIGACCESS Conference on Computers and Accessibility是为残疾人和老年人提供与计算机相关的设计、评估、使用和教育研究的首要论坛。我们欢迎提交原始的高质量的有关计算和可访问性的主题。今年，ASSETS首次将其范围扩大到包括关于计算机无障碍教育相关主题的原创高质量研究。官网链接：http://assets19.sigaccess.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日