Diffusion Transformers (DiTs) power high-fidelity video world models but remain computationally expensive due to sequential denoising and costly spatio-temporal attention. Training-free feature caching accelerates inference by reusing intermediate activations across denoising steps; however, existing methods largely rely on a Zero-Order Hold assumption i.e., reusing cached features as static snapshots when global drift is small. This often leads to ghosting artifacts, blur, and motion inconsistencies in dynamic scenes. We propose \textbf{WorldCache}, a Perception-Constrained Dynamical Caching framework that improves both when and how to reuse features. WorldCache introduces motion-adaptive thresholds, saliency-weighted drift estimation, optimal approximation via blending and warping, and phase-aware threshold scheduling across diffusion steps. Our cohesive approach enables adaptive, motion-consistent feature reuse without retraining. On Cosmos-Predict2.5-2B evaluated on PAI-Bench, WorldCache achieves \textbf{2.3$\times$} inference speedup while preserving \textbf{99.4\%} of baseline quality, substantially outperforming prior training-free caching approaches. Our code can be accessed on \href{https://umair1221.github.io/World-Cache/}{World-Cache}.
翻译:扩散Transformer(DiTs)驱动了高保真视频世界模型,但由于顺序去噪和昂贵的时空注意力机制,其计算成本仍然很高。免训练特征缓存通过跨去噪步骤重用中间激活来加速推理;然而,现有方法大多依赖于零阶保持假设,即在全局漂移较小时将缓存特征作为静态快照重用。这通常会导致动态场景中的鬼影伪影、模糊和运动不一致。我们提出**WorldCache**,一种感知约束的动态缓存框架,改进了特征重用的时机和方式。WorldCache引入了运动自适应阈值、显著性加权漂移估计、通过混合与扭曲实现的近似最优,以及跨扩散步骤的相位感知阈值调度。我们的统一方法实现了自适应、运动一致的特征重用,且无需重新训练。在PAI-Bench上评估的Cosmos-Predict2.5-2B模型中,WorldCache实现了**2.3×**的推理加速,同时保持了**99.4%**的基线质量,显著优于先前的免训练缓存方法。我们的代码可在World-Cache获取。