Do video diffusion models encode signals predictive of physical plausibility? We probe intermediate denoising representations of a pretrained Diffusion Transformer (DiT) and find that physically plausible and implausible videos are partially separable in mid-layer feature space across noise levels. This separability cannot be fully attributed to visual quality or generator identity, suggesting recoverable physics-related cues in frozen DiT features. Leveraging this observation, we introduce progressive trajectory selection, an inference-time strategy that scores parallel denoising trajectories at a few intermediate checkpoints using a lightweight physics verifier trained on frozen features, and prunes low-scoring candidates early. Extensive experiments on PhyGenBench demonstrate that our method improves physical consistency while reducing inference cost, achieving comparable results to Best-of-K sampling with substantially fewer denoising steps.
翻译:视频扩散模型是否编码了预测物理合理性的信号?我们探究了预训练扩散Transformer(DiT)的中间去噪表示,发现物理合理与不合理的视频在不同噪声水平下,其中层特征空间具有部分可分离性。这种可分离性不能完全归因于视觉质量或生成器身份,表明冻结的DiT特征中存在可恢复的物理相关线索。基于此发现,我们提出渐进式轨迹选择策略——一种推理阶段方法,利用基于冻结特征训练的轻量级物理验证器,在少数中间检查点对并行去噪轨迹进行评分,并提前剪除低分候选。在PhyGenBench上的大量实验表明,我们的方法在提升物理一致性的同时降低了推理成本,以显著更少的去噪步骤达到了与Best-of-K采样相当的效果。