TIE: Time Interval Encoding for Video Generation over Events

Director-style prompting, robotic action prediction, and interactive video agents demand temporal grounding over concurrent events -- a regime in which 68% of general clips and over 99% of robotics/gameplay clips contain overlapping events, yet existing multi-event generators rest on a single-active-prompt assumption. However, modern video generators, such as Diffusion Transformers (DiT), represent time as discrete points through point-wise positional encodings. This formulation creates a fundamental dimension mismatch: temporally extended intervals and overlapping events are mathematically unrepresentable to the attention mechanism. In this paper, we propose Time Interval Encoding (TIE), a principled, plug-and-play interval-aware generalization of rotary embeddings that elevates time intervals to first-class primitives inside DiT cross-attention. Rather than introducing another heuristic interval embedding, we show that, within RoPE-compatible bilinear attention, TIE is characterized by two basic principles: Temporal Integrability, which requires an event to aggregate positional evidence over its full duration, and Duration Invariance, which removes the trivial bias toward longer intervals. Under a uniform kernel, this characterization yields an efficient closed-form sinc-based solution that preserves the standard attention interface and naturally attenuates boundary noise through interval integration. Empirically, TIE preserves the visual quality of the base DiT model while substantially improving temporal controllability. In our experiments on the OmniEvents dataset, it improves human-verified Temporal Constraint Satisfaction Rate from 77.34% to 96.03% and reduces temporal boundary error from 0.261s to 0.073s, while also improving trajectory-level temporal alignment metrics. The code and dataset are available at https://github.com/MatrixTeam-AI/TIE.

翻译：导演式提示、机器人动作预测及交互式视频代理要求对并发事件进行时间定位——在此场景下，68%的通用视频片段和超过99%的机器人/游戏视频片段包含重叠事件，而现有多事件生成器均基于单一活跃提示假设。然而，当代视频生成器（如扩散变换器DiT）通过逐点位置编码将时间表示为离散点。这种形式化方法导致根本性的维度失配：时域扩展区间和重叠事件在注意力机制中无法用数学表示。本文提出时间区间编码（Time Interval Encoding, TIE），这是一种基于原理的即插即用型区间感知旋转嵌入泛化方法，将时间区间提升为DiT交叉注意力中的一等公民。我们未引入另一种启发式区间嵌入，而是证明在兼容RoPE的双线性注意力中，TIE由两个基本原则刻画：时间可积性要求事件在其完整持续时间内聚合位置证据，以及时长不变性消除对较长区间的平凡偏差。在均匀核函数下，这一刻画导出了基于sinc函数的闭式高效解，该解保留标准注意力接口并通过区间积分自然抑制边界噪声。实验表明，TIE在保持基础DiT模型视觉质量的同时显著提升时间可控性。在OmniEvents数据集上的实验中，该方法将人工验证的时间约束满足率从77.34%提升至96.03%，将时间边界误差从0.261秒降低至0.073秒，同时改进了轨迹级时间对齐指标。代码和数据集详见https://github.com/MatrixTeam-AI/TIE。