Flow Matching has enabled robust text-to-video generation via latent ODE sampling. However, velocity approximation and numerical discretization errors inevitably accumulate, causing sampling trajectories to drift. Consequently, generated videos often suffer from severe spatiotemporal inconsistencies. Nevertheless, directly correcting these drifted, noisy latents is challenging: (i) timestep-dependent noise obscures reliable structural cues; (ii) spatial interventions risk disrupting intricate local geometry while incurring heavy computational costs. To address this, we propose Spectral Lookahead Rectification (SpecLoR), a plug-and-play inference method that bypasses noise via lookahead prediction, and circumvents spatiotemporal entanglement by shifting corrections to the frequency domain, where universal statistical priors of natural videos are readily available. First, during early sampling stages, SpecLoR looks ahead to estimate the clean latent $z_{t,0}$ and computes its 3D spatiotemporal spectrum. Next, SpecLoR rectifies the amplitude spectrum to match the prior, leaving the phase intact. Finally, the corrected state is re-noised to resume ODE integration. Experiments on Wan2.2 demonstrate that SpecLoR significantly reduces physical artifacts and enhances motion coherence across multiple benchmarks with minimal computational overhead (4 additional NFEs).
翻译:流匹配技术通过潜在常微分方程采样已实现鲁棒的文本生成视频。然而,速度逼近与数值离散误差会不可避免地累积,导致采样轨迹偏移,进而使生成的视频常出现严重的时空不一致性。直接修正这些漂移且含噪的潜在变量面临双重挑战:(i)与时间步长相关的噪声会掩盖可靠的结构线索;(ii)空间域的干预可能破坏复杂的局部几何结构,同时带来高昂的计算成本。为此,本文提出频谱前瞻矫正(SpecLoR)——一种即插即用的推理方法,通过前瞻预测规避噪声,并将修正操作转移至频域,利用自然视频中普遍存在的统计先验来避免时空纠缠。具体而言,在早期采样阶段,SpecLoR首先进行前瞻以估计干净潜在变量$z_{t,0}$,并计算其三维时空频谱。随后,SpecLoR修正振幅频谱以匹配先验,同时保持相位不变。最后,将修正后的状态重新加噪以恢复常微分方程积分。在Wan2.2上的实验表明,SpecLoR能以极小的计算开销(仅增加4次NFE)显著减少物理伪影,并在多项基准测试中增强运动连贯性。