This paper introduces a tuning-free method for both object insertion and subject-driven generation. The task involves composing an object, given multiple views, into a scene specified by either an image or text. Existing methods struggle to fully meet the task's challenging objectives: (i) seamlessly composing the object into the scene with photorealistic pose and lighting, and (ii) preserving the object's identity. We hypothesize that achieving these goals requires large scale supervision, but manually collecting sufficient data is simply too expensive. The key observation in this paper is that many mass-produced objects recur across multiple images of large unlabeled datasets, in different scenes, poses, and lighting conditions. We use this observation to create massive supervision by retrieving sets of diverse views of the same object. This powerful paired dataset enables us to train a straightforward text-to-image diffusion architecture to map the object and scene descriptions to the composited image. We compare our method, ObjectMate, with state-of-the-art methods for object insertion and subject-driven generation, using a single or multiple references. Empirically, ObjectMate achieves superior identity preservation and more photorealistic composition. Differently from many other multi-reference methods, ObjectMate does not require slow test-time tuning.
翻译:本文提出了一种无需微调的方法,适用于物体插入与主体驱动生成任务。该任务涉及将给定多视角的物体组合到由图像或文本指定的场景中。现有方法难以完全满足该任务的两个挑战性目标:(i) 以逼真的姿态和光照将物体无缝合成到场景中;(ii) 保持物体的身份特征。我们假设实现这些目标需要大规模监督数据,但人工收集足够数据成本过高。本文的关键观察是:在大规模无标注数据集中,许多批量生产的物体会以不同场景、姿态和光照条件重复出现在多张图像中。我们利用这一观察,通过检索同一物体的多样化视角集合来构建大规模监督数据。这一强大的配对数据集使我们能够训练一个简单的文本到图像扩散架构,将物体和场景描述映射到合成图像。我们将所提方法 ObjectMate 与最先进的物体插入和主体驱动生成方法进行比较,包括使用单参考或多参考的场景。实验表明,ObjectMate 在身份保持和合成逼真度方面均表现更优。与许多其他多参考方法不同,ObjectMate 无需耗时的测试时微调。