Recent advances in text-to-video (T2V) generation with diffusion models have garnered significant attention. However, they typically perform well in scenes with a single object and motion, struggling in compositional scenarios with multiple objects and distinct motions to accurately reflect the semantic content of text prompts. To address these challenges, we propose \textbf{StarVid}, a plug-and-play, training-free method that improves semantic alignment between multiple subjects, their motions, and text prompts in T2V models. StarVid first leverages the spatial reasoning capabilities of large language models (LLMs) for two-stage motion trajectory planning based on text prompts. Such trajectories serve as spatial priors, guiding a spatial-aware loss to refocus cross-attention (CA) maps into distinctive regions. Furthermore, we propose a syntax-guided contrastive constraint to strengthen the correlation between the CA maps of verbs and their corresponding nouns, enhancing motion-subject binding. Both qualitative and quantitative evaluations demonstrate that the proposed framework significantly outperforms baseline methods, delivering videos of higher quality with improved semantic consistency.
翻译:基于扩散模型的文本到视频(T2V)生成技术的最新进展已引起广泛关注。然而,现有方法通常在包含单一物体和运动的场景中表现良好,而在涉及多个物体及不同运动的复合场景中,往往难以准确反映文本提示的语义内容。为应对这些挑战,我们提出了 \textbf{StarVid},一种即插即用、无需训练的方法,旨在提升T2V模型中多个主体、其运动与文本提示之间的语义对齐。StarVid首先利用大语言模型(LLMs)的空间推理能力,基于文本提示进行两阶段的运动轨迹规划。此类轨迹作为空间先验,引导一种空间感知损失,将交叉注意力(CA)图重聚焦至不同的区域。此外,我们提出了一种句法引导的对比约束,以加强动词与其对应名词的CA图之间的关联,从而增强运动与主体的绑定。定性与定量评估均表明,所提出的框架显著优于基线方法,能够生成具有更高语义一致性的高质量视频。