State-of-the-art text-to-video models often look realistic frame-by-frame yet fail on simple interactions: motion starts before contact, actions are not realized, objects drift after placement, and support relations break. We argue this stems from frame-first denoising, which updates latent state everywhere at every step without an explicit notion of when and where an interaction is active. We introduce Event-Driven Video Generation (EVD), a minimal DiT-compatible framework that makes sampling event-grounded: a lightweight event head predicts token-aligned event activity, event-grounded losses couple activity to state change during training, and event-gated sampling (with hysteresis and early-step scheduling) suppresses spurious updates while concentrating updates during interactions. On EVD-Bench, EVD consistently improves human preference and VBench dynamics, substantially reducing failure modes in state persistence, spatial accuracy, support relations, and contact stability without sacrificing appearance. These results indicate that explicit event grounding is a practical abstraction for reducing interaction hallucinations in video generation.
翻译:当前最先进的文本到视频模型在逐帧层面常呈现逼真效果,但在简单交互中仍存在缺陷:运动在接触前就开始、动作未实现、物体放置后漂移、支撑关系断裂等问题。我们论证这源于以帧为先的去噪策略——该策略在每个步骤中无差别更新所有位置的隐状态,缺乏对交互激活时空位置的显式表征。本文提出事件驱动视频生成(EVD),这是一种极简的DiT兼容框架,可实现基于事件的采样:轻量级事件头预测令牌对齐的事件活动,事件驱动损失在训练中将事件活动与状态变化耦合,事件门控采样(含迟滞与早期步骤调度)在抑制虚假更新的同时集中处理交互期间的状态更新。在EVD-Bench基准测试中,EVD在人类偏好度与VBench动态指标上表现稳定,显著减少了状态持续性、空间精度、支撑关系及接触稳定性方面的故障模式,且不牺牲外观质量。这些结果表明:显式事件锚定是减少视频生成中交互幻觉的有效抽象手段。