Model customization introduces new concepts to existing text-to-image models, enabling the generation of the new concept in novel contexts. However, such methods lack accurate camera view control w.r.t the object, and users must resort to prompt engineering (e.g., adding "top-view") to achieve coarse view control. In this work, we introduce a new task -- enabling explicit control of camera viewpoint for model customization. This allows us to modify object properties amongst various background scenes via text prompts, all while incorporating the target camera pose as additional control. This new task presents significant challenges in merging a 3D representation from the multi-view images of the new concept with a general, 2D text-to-image model. To bridge this gap, we propose to condition the 2D diffusion process on rendered, view-dependent features of the new object. During training, we jointly adapt the 2D diffusion modules and 3D feature predictions to reconstruct the object's appearance and geometry while reducing overfitting to the input multi-view images. Our method outperforms existing image editing and model personalization baselines in preserving the custom object's identity while following the input text prompt and the object's camera pose.
翻译:模型定制化将新概念引入现有文本到图像模型,从而能够在新场景中生成该概念。然而,此类方法缺乏针对物体的精确相机视角控制,用户只能借助提示工程(例如添加"俯视图")来实现粗略视角控制。本研究提出一项新任务——为模型定制化赋予相机视角的显式控制能力。这使得我们能够通过文本提示修改物体在不同背景场景中的属性,同时将目标相机姿态作为附加控制条件。该新任务的核心挑战在于:如何将新概念多视角图像中的三维表征与通用的二维文本到图像模型相融合。为弥合这一鸿沟,我们提出将二维扩散过程的条件建立在渲染后的物体视角相关特征上。在训练阶段,我们联合调整二维扩散模块与三维特征预测,在减少对输入多视角图像过拟合的同时重建物体的外观与几何结构。与现有图像编辑和模型个性化基准方法相比,本方法在保持定制物体身份特征的同时,能更精准地遵循输入文本提示与物体相机姿态。