Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.
翻译:图像合成旨在将用户指定的对象无缝插入新场景,但现有模型在处理复杂光照(如精确阴影、水面反射)及多样化高分辨率输入时仍面临挑战。现代文生图扩散模型(如SD3.5、FLUX)虽已编码关键物理与分辨率先验,却缺乏无需依赖潜在反演即可释放这些能力的框架——后者常将物体姿态锁定于上下文不合理的朝向,或依赖脆弱的注意力调整机制。本文提出SHINE框架:一种免训练的、可实现无缝高保真插入与误差中和的方法。SHINE引入流形导向锚点损失,利用预训练定制适配器(如IP-Adapter)引导潜在表征以忠实呈现主体,同时保持背景完整性。进一步提出退化抑制引导与自适应背景融合策略,以消除低质量输出与可见接缝。针对严谨基准测试的缺失,我们构建了ComplexCompo数据集,涵盖多样化分辨率及低光照、强照明、复杂阴影、反射表面等挑战性条件。在ComplexCompo与DreamEditBench上的实验表明,该方法在标准指标(如DINOv2)和人类对齐评分(如DreamSim、ImageReward、VisionReward)上均达到最先进性能。代码与基准数据集将于论文发表时公开。