Multi-domain image-to-image translation re quires grounding semantic differences ex pressed in natural language prompts into corresponding visual transformations, while preserving unrelated structural and seman tic content. Existing methods struggle to maintain structural integrity and provide fine grained, attribute-specific control, especially when multiple domains are involved. We propose LACE (Language-grounded Attribute Controllable Translation), built on two compo nents: (1) a GLIP-Adapter that fuses global semantics with local structural features to pre serve consistency, and (2) a Multi-Domain Control Guidance mechanism that explicitly grounds the semantic delta between source and target prompts into per-attribute translation vec tors, aligning linguistic semantics with domain level visual changes. Together, these modules enable compositional multi-domain control with independent strength modulation for each attribute. Experiments on CelebA(Dialog) and BDD100K demonstrate that LACE achieves high visual fidelity, structural preservation, and interpretable domain-specific control, surpass ing prior baselines. This positions LACE as a cross-modal content generation framework bridging language semantics and controllable visual translation.
翻译:多域图像到图像翻译需要将自然语言提示中表达的语义差异嵌入到相应的视觉变换中,同时保持无关的结构和语义内容不变。现有方法难以维持结构完整性并提供细粒度、属性特定的控制,尤其在涉及多个域时。我们提出了LACE(语言驱动的属性可控翻译),其构建于两个组件之上:(1) 一个GLIP-Adapter,融合全局语义与局部结构特征以保持一致性;(2) 一个多域控制引导机制,将源提示与目标提示之间的语义差异显式地嵌入到每个属性的翻译向量中,从而对齐语言语义与域级别的视觉变化。这些模块共同实现了组合式的多域控制,并能对每个属性进行独立的强度调节。在CelebA(Dialog)和BDD100K数据集上的实验表明,LACE在视觉保真度、结构保持和可解释的域特定控制方面均表现出色,超越了现有基线方法。这使LACE成为一个连接语言语义与可控视觉翻译的跨模态内容生成框架。