This work presents AnyDoor, a diffusion-based image generator with the power to teleport target objects to new scenes at user-specified locations in a harmonious way. Instead of tuning parameters for each object, our model is trained only once and effortlessly generalizes to diverse object-scene combinations at the inference stage. Such a challenging zero-shot setting requires an adequate characterization of a certain object. To this end, we complement the commonly used identity feature with detail features, which are carefully designed to maintain texture details yet allow versatile local variations (e.g., lighting, orientation, posture, etc.), supporting the object in favorably blending with different surroundings. We further propose to borrow knowledge from video datasets, where we can observe various forms (i.e., along the time axis) of a single object, leading to stronger model generalizability and robustness. Extensive experiments demonstrate the superiority of our approach over existing alternatives as well as its great potential in real-world applications, such as virtual try-on and object moving. Project page is https://damo-vilab.github.io/AnyDoor-Page/.
翻译:本文提出AnyDoor——一种基于扩散模型的图像生成器,能够以和谐的方式将目标物体传送至用户指定位置的新场景。不同于为每个物体调整参数,我们的模型仅需单次训练,即可在推理阶段无缝泛化至多样化的物体-场景组合。这一极具挑战性的零样本设置要求对特定物体进行充分表征。为此,我们在常用的身份特征基础上补充了细节特征,这些特征经过精心设计,既能保留纹理细节,又能支持灵活的局部变化(如光照、方向、姿态等),使物体能自然融入不同环境。我们进一步提出从视频数据集中汲取知识——通过观察单个物体随时间轴呈现的多种形态,从而增强模型的泛化能力与鲁棒性。大量实验表明,我们的方法优于现有方案,并且在虚拟试穿与物体移动等真实应用中展现出巨大潜力。项目主页:https://damo-vilab.github.io/AnyDoor-Page/。