Despite recent advances in Text-to-Video (T2V) synthesis, generating high-fidelity and dynamic motion remains a significant challenge. Existing methods primarily rely on Classifier-Free Guidance (CFG), often with explicit negative prompts (e.g. "static", "blurry"), to suppress undesired artifacts. However, such explicit negations frequently introduce unintended semantic bias and distort object integrity; a phenomenon we define as Content-Motion Drift. To address this, we propose MotionCFG, a framework that enhances motion dynamics by contrasting a target concept with its noise-perturbed counterparts. Specifically, by injecting Gaussian noise into the concept embeddings, MotionCFG creates localized negative anchors that encapsulate a broad complementary space of sub-optimal motion variations. Unlike explicit negations, this approach facilitates implicit hard negative mining without shifting the global semantic identity, allowing for a focused refinement of temporal details. Combined with a piecewise guidance schedule that confines intervention to the early denoising steps, MotionCFG consistently improves motion dynamics across state-of-the-art T2V frameworks with negligible computational overhead and minimal compromise in visual quality. Additionally, we demonstrate that this noise-induced contrastive mechanism is effective not only for sharpening motion trajectories but also for steering complex, non-linear concepts such as precise object numerosity, which are typically difficult to modulate via standard text-based guidance.
翻译:尽管文本到视频(T2V)合成技术近期取得了进展,生成高保真且动态的运动仍然是一个重大挑战。现有方法主要依赖于无分类器引导(CFG),通常配合显式的负面提示(例如“静态”、“模糊”)来抑制不期望的伪影。然而,此类显式否定常常引入非预期的语义偏差并扭曲对象完整性;我们将这一现象定义为内容-运动漂移。为解决此问题,我们提出了MotionCFG,一个通过对比目标概念与其噪声扰动对应物来增强运动动态的框架。具体而言,通过向概念嵌入中注入高斯噪声,MotionCFG创建了局部化的负锚点,这些锚点封装了次优运动变化的广泛互补空间。与显式否定不同,此方法促进了隐式的困难负样本挖掘,而不会改变全局语义身份,从而允许对时序细节进行聚焦式优化。结合一个将干预限制在早期去噪步骤的分段引导调度策略,MotionCFG能够持续改进最先进T2V框架中的运动动态,且计算开销可忽略不计,视觉质量损失最小。此外,我们证明了这种噪声诱导的对比机制不仅对锐化运动轨迹有效,还能用于引导复杂的非线性概念,例如精确的对象数量,这些概念通常难以通过标准的基于文本的引导进行调节。