This paper studies reference-free style-conditioned character generation in text-to-image diffusion models, where high-quality synthesis requires both stable character structure and consistent, fine-grained style expression across diverse prompts. Existing approaches primarily rely on text-only prompting, which is often under-specified for visual style and tends to produce noticeable style drift and geometric inconsistency, or introduce reference-based adapters that depend on external images at inference time, increasing architectural complexity and limiting deployment flexibility.We propose PokeFusion Attention, a lightweight decoder-level cross-attention mechanism that fuses textual semantics with learned style embeddings directly inside the diffusion decoder. By decoupling text and style conditioning at the attention level, our method enables effective reference-free stylized generation while keeping the pretrained diffusion backbone fully frozen.PokeFusion Attention trains only decoder cross-attention layers together with a compact style projection module, resulting in a parameter-efficient and plug-and-play control component that can be easily integrated into existing diffusion pipelines and transferred across different backbones.Experiments on a stylized character generation benchmark (Pokemon-style) demonstrate that our method consistently improves style fidelity, semantic alignment, and character shape consistency compared with representative adapter-based baselines, while maintaining low parameter overhead and inference-time simplicity.
翻译:本文研究了文本到图像扩散模型中的无参考风格条件角色生成,其中高质量的合成既需要稳定的角色结构,也需要在不同提示下保持一致的细粒度风格表达。现有方法主要依赖于纯文本提示,这对于视觉风格而言往往指定不足,容易产生明显的风格漂移和几何不一致性;或者引入基于参考的适配器,在推理时依赖外部图像,从而增加了架构复杂性并限制了部署灵活性。我们提出了PokeFusion注意力,一种轻量级的解码器级交叉注意力机制,直接在扩散解码器内部将文本语义与学习到的风格嵌入进行融合。通过在注意力层面解耦文本和风格条件,我们的方法能够实现有效的无参考风格化生成,同时保持预训练的扩散主干网络完全冻结。PokeFusion注意力仅训练解码器交叉注意力层以及一个紧凑的风格投影模块,从而形成一个参数高效、即插即用的控制组件,可以轻松集成到现有的扩散流程中,并迁移到不同的主干网络上。在风格化角色生成基准(宝可梦风格)上的实验表明,与代表性的基于适配器的基线方法相比,我们的方法在风格保真度、语义对齐和角色形状一致性方面持续提升,同时保持了较低的参数开销和推理时的简洁性。