Recent advancements in audio-visual generative modeling have been propelled by progress in deep learning and the availability of data-rich benchmarks. However, the growth is not attributed solely to models and benchmarks. Universally accepted evaluation metrics also play an important role in advancing the field. While there are many metrics available to evaluate audio and visual content separately, there is a lack of metrics that offer a quantitative and interpretable measure of audio-visual synchronization for videos "in the wild". To address this gap, we first created a large scale human annotated dataset (100+ hrs) representing nine types of synchronization errors in audio-visual content and how human perceive them. We then developed a PEAVS (Perceptual Evaluation of Audio-Visual Synchrony) score, a novel automatic metric with a 5-point scale that evaluates the quality of audio-visual synchronization. We validate PEAVS using a newly generated dataset, achieving a Pearson correlation of 0.79 at the set level and 0.54 at the clip level when compared to human labels. In our experiments, we observe a relative gain 50% over a natural extension of Fr\'echet based metrics for Audio-Visual synchrony, confirming PEAVS efficacy in objectively modeling subjective perceptions of audio-visual synchronization for videos "in the wild".
翻译:近期,音频-视觉生成建模的进步得益于深度学习的发展和数据密集型基准测试的可用性。然而,这一增长并非仅归功于模型和基准测试。普遍接受的评估指标在推动该领域发展中也发挥着重要作用。尽管目前存在许多分别评估音频和视觉内容的指标,但缺乏一种能够为"野外"视频提供可量化和可解释的视听同步性测量的指标。为填补这一空白,我们首先构建了一个大规模人工标注数据集(超过100小时),涵盖九类视听内容同步误差及其人类感知方式。随后,我们开发了PEAVS(基于观众评分感知的视听同步性评估)分数——一种采用5分制的新型自动评估指标,用于衡量视听同步质量。我们通过新生成的数据集验证PEAVS,结果显示:在集合层面与人工标注的皮尔逊相关系数为0.79,在片段层面为0.54。实验中,相比Fr'echet基指标的视听同步性自然扩展方法,我们观察到相对提升达50%,这证实了PEAVS在客观建模"野外"视频视听同步性主观感知方面的有效性。