Modern video generators still struggle with complex physical dynamics, often falling short of physical realism. Existing approaches address this using external verifiers or additional training on augmented data, which is computationally expensive and still limited in capturing fine-grained motion. In this work, we present self-refining video sampling, a simple method that uses a pre-trained video generator trained on large-scale datasets as its own self-refiner. By interpreting the generator as a denoising autoencoder, we enable iterative inner-loop refinement at inference time without any external verifier or additional training. We further introduce an uncertainty-aware refinement strategy that selectively refines regions based on self-consistency, which prevents artifacts caused by over-refinement. Experiments on state-of-the-art video generators demonstrate significant improvements in motion coherence and physics alignment, achieving over 70\% human preference compared to the default sampler and guidance-based sampler.
翻译:现代视频生成器在处理复杂物理动态时仍面临挑战,往往难以达到物理真实感。现有方法通过外部验证器或对增强数据进行额外训练来解决此问题,但计算成本高昂且在捕捉细粒度运动方面仍存在局限。本研究提出自优化视频采样,这是一种简单的方法,利用在大规模数据集上预训练的视频生成器作为其自身的优化器。通过将生成器解释为去噪自编码器,我们实现了在推理阶段无需外部验证器或额外训练的迭代内循环优化。我们进一步引入基于不确定性的优化策略,该策略根据自一致性选择性地优化区域,从而避免因过度优化导致的伪影。在先进视频生成器上的实验表明,该方法在运动连贯性和物理对齐方面取得显著改进,相比默认采样器和基于引导的采样器,获得了超过70%的人类偏好度。