Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to distinguish and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic visual settings. We propose Scone, a unified understanding-generation method that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge, conveying semantic information and guiding the generation expert to preserve subject identity while minimizing interference. A two-stage training scheme first learns composition, then enhances distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark for evaluating both composition and distinction across diverse scenarios. Experiments demonstrate that Scone outperforms existing open-source models in composition and distinction tasks on two benchmarks. Our model, benchmark, and training data are available at: https://github.com/Ryann-Ran/Scone.
翻译:主题驱动图像生成已从单主题组合发展到多主题复合,但忽略了区分性——即在输入包含多个候选对象时辨别并正确生成指定主题的能力。这一局限性限制了其在复杂现实视觉场景中的有效性。我们提出Scone,一种统一理解-生成方法,将组合性与区分性融为一体。该方法使理解专家作为语义桥梁,传递语义信息并引导生成专家在最小化干扰的同时保持主题身份。两阶段训练方案先学习组合性,再通过语义对齐和基于注意力的掩码增强区分性。我们还引入SconeEval,一个用于在多样化场景中评估组合性与区分性的基准。实验表明,Scone在两个基准的组合与区分任务中均优于现有开源模型。我们的模型、基准及训练数据已开源:https://github.com/Ryann-Ran/Scone。