Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face-attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose LumosX, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject-attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that LumosX achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation. Code and models are available at https://jiazheng-xing.github.io/lumosx-home/.
翻译:扩散模型的最新进展显著提升了文本到视频生成的能力,实现了对前景和背景元素的细粒度控制,从而支持个性化内容创作。然而,跨主体的精确面部-属性对齐仍具挑战性,现有方法缺乏确保组内一致性的显式机制。为弥补这一不足,既需要显式建模策略,也需要面部属性感知的数据资源。因此,我们提出LumosX框架,该框架同时推进了数据与模型的设计。在数据方面,一个定制化的收集流程从独立视频中协调字幕与视觉线索,同时多模态大语言模型(MLLMs)推断并分配主体特定的依赖关系。这些提取的关系先验施加了更细粒度的结构,增强了个性化视频生成的表达控制能力,并支持构建综合性基准。在模型方面,关系自注意力与关系交叉注意力将位置感知嵌入与精细化的注意力动态交织,从而刻入显式的主体-属性依赖关系,强制实现受约束的组内一致性,并增强不同主体簇之间的区分度。在我们基准上的全面评估表明,LumosX在细粒度、身份一致且语义对齐的个性化多主体视频生成中达到了最先进的性能。代码与模型已开源至https://jiazheng-xing.github.io/lumosx-home/。