IMAGHarmony：基于可控对象数量与布局一致性的图像编辑 (IMAGHarmony: Controllable Image Editing with Consistent Object Quantity and Layout)

Recent diffusion models have advanced image editing by improving fidelity and controllability across creative and personalized applications. However, multi-object scenes remain challenging, as reliable control over object categories, counts, and spatial layout is difficult to achieve. For that, we first study quantity and layout consistent image editing, abbreviated as QL-Edit, which targets control of object quantity and spatial layout in multi-object scenes. Then, we present IMAGHarmony, a straightforward framework featuring a plug-and-play harmony aware (HA) module that fuses perception semantics while modeling object counts and locations, resulting in accurate edits and strong structural consistency. We further observe that diffusion models are sensitive to the choice of initial noise and tend to prefer certain noise patterns. Based on this finding, we present a preference-guided noise selection (PNS) strategy that selects semantically aligned initial noise through vision and language matching, thereby further improving generation stability and layout consistency in multiple object editing. To support evaluation, we develop HarmonyBench, a comprehensive benchmark that covers a diverse range of quantity and layout control scenarios. Extensive experiments demonstrate that IMAGHarmony outperforms prior methods in both structural alignment and semantic accuracy, utilizing only 200 training images and 10.6M of trainable parameters. Code, models, and data are available at https://github.com/muzishen/IMAGHarmony.

翻译：近年来，扩散模型通过提升创意与个性化应用中的保真度与可控性，推动了图像编辑技术的发展。然而，多对象场景的处理仍具挑战性，因为难以实现对对象类别、数量及空间布局的可靠控制。为此，我们首先研究了数量与布局一致的图像编辑（简称QL-Edit），其目标在于控制多对象场景中的对象数量与空间布局。随后，我们提出了IMAGHarmony——一个简洁的框架，其配备即插即用的和谐感知（HA）模块，该模块在建模对象数量与位置的同时融合感知语义，从而实现精确编辑与强结构一致性。我们进一步观察到，扩散模型对初始噪声的选择较为敏感，且倾向于偏好特定的噪声模式。基于这一发现，我们提出了一种偏好引导的噪声选择（PNS）策略，该策略通过视觉与语言匹配选择语义对齐的初始噪声，从而进一步提升多对象编辑中的生成稳定性与布局一致性。为支持评估，我们构建了HarmonyBench——一个涵盖多样化数量与布局控制场景的综合基准。大量实验表明，IMAGHarmony在结构对齐与语义准确性方面均优于现有方法，且仅需200张训练图像与1060万个可训练参数。代码、模型及数据可在 https://github.com/muzishen/IMAGHarmony 获取。