The Fr\'echet Video Distance (FVD) is a widely adopted metric for evaluating video generation distribution quality. However, its effectiveness relies on critical assumptions. Our analysis reveals three significant limitations: (1) the non-Gaussianity of the Inflated 3D Convnet (I3D) feature space; (2) the insensitivity of I3D features to temporal distortions; (3) the impractical sample sizes required for reliable estimation. These findings undermine FVD's reliability and show that FVD falls short as a standalone metric for video generation evaluation. After extensive analysis of a wide range of metrics and backbone architectures, we propose JEDi, the JEPA Embedding Distance, based on features derived from a Joint Embedding Predictive Architecture, measured using Maximum Mean Discrepancy with polynomial kernel. Our experiments on multiple open-source datasets show clear evidence that it is a superior alternative to the widely used FVD metric, requiring only 16% of the samples to reach its steady value, while increasing alignment with human evaluation by 34%, on average.
翻译:Fr\'echet视频距离(FVD)是评估视频生成分布质量时广泛采用的指标。然而,其有效性依赖于若干关键假设。我们的分析揭示了三个显著局限性:(1)膨胀三维卷积网络(I3D)特征空间的非高斯性;(2)I3D特征对时序失真的不敏感性;(3)可靠估计所需样本量在实际应用中难以实现。这些发现削弱了FVD的可靠性,表明其作为视频生成评估的独立指标存在不足。在对多种指标与骨干架构进行广泛分析后,我们提出基于联合嵌入预测架构特征的JEDi(JEPA嵌入距离),该距离采用多项式核的最大均值差异进行度量。在多个开源数据集上的实验表明,JEDi明显优于广泛使用的FVD指标,仅需16%的样本量即可达到稳定值,同时与人类评估的一致性平均提升34%。