StarVid：通过空间与句法引导的注意力重聚焦增强视频扩散模型中的语义对齐 (StarVid: Enhancing Semantic Alignment in Video Diffusion Models via Spatial and SynTactic Guided Attention Refocusing)

Recent advances in text-to-video (T2V) generation with diffusion models have garnered significant attention. However, they typically perform well in scenes with a single object and motion, struggling in compositional scenarios with multiple objects and distinct motions to accurately reflect the semantic content of text prompts. To address these challenges, we propose \textbf{StarVid}, a plug-and-play, training-free method that improves semantic alignment between multiple subjects, their motions, and text prompts in T2V models. StarVid first leverages the spatial reasoning capabilities of large language models (LLMs) for two-stage motion trajectory planning based on text prompts. Such trajectories serve as spatial priors, guiding a spatial-aware loss to refocus cross-attention (CA) maps into distinctive regions. Furthermore, we propose a syntax-guided contrastive constraint to strengthen the correlation between the CA maps of verbs and their corresponding nouns, enhancing motion-subject binding. Both qualitative and quantitative evaluations demonstrate that the proposed framework significantly outperforms baseline methods, delivering videos of higher quality with improved semantic consistency.

翻译：基于扩散模型的文本到视频（T2V）生成技术的最新进展已引起广泛关注。然而，现有方法通常在包含单一物体和运动的场景中表现良好，而在涉及多个物体及不同运动的复合场景中，往往难以准确反映文本提示的语义内容。为应对这些挑战，我们提出了 \textbf{StarVid}，一种即插即用、无需训练的方法，旨在提升T2V模型中多个主体、其运动与文本提示之间的语义对齐。StarVid首先利用大语言模型（LLMs）的空间推理能力，基于文本提示进行两阶段的运动轨迹规划。此类轨迹作为空间先验，引导一种空间感知损失，将交叉注意力（CA）图重聚焦至不同的区域。此外，我们提出了一种句法引导的对比约束，以加强动词与其对应名词的CA图之间的关联，从而增强运动与主体的绑定。定性与定量评估均表明，所提出的框架显著优于基线方法，能够生成具有更高语义一致性的高质量视频。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日