This paper presents a benchmark dataset for aligning lecture videos with corresponding slides and introduces a novel multimodal algorithm leveraging features from speech, text, and images. It achieves an average accuracy of 0.82 in comparison to SIFT (0.56) while being approximately 11 times faster. Using dynamic programming the algorithm tries to determine the optimal slide sequence. The results show that penalizing slide transitions increases accuracy. Features obtained via optical character recognition (OCR) contribute the most to a high matching accuracy, followed by image features. The findings highlight that audio transcripts alone provide valuable information for alignment and are beneficial if OCR data is lacking. Variations in matching accuracy across different lectures highlight the challenges associated with video quality and lecture style. The novel multimodal algorithm demonstrates robustness to some of these challenges, underscoring the potential of the approach.
翻译:本文提出了一个用于对齐讲座视频与相应幻灯片的基准数据集,并介绍了一种利用语音、文本和图像特征的新型多模态算法。该算法平均准确率达到0.82,优于SIFT算法(0.56),且处理速度提升约11倍。算法采用动态规划方法以确定最优幻灯片序列。结果表明,对幻灯片切换施加惩罚能有效提高对齐精度。通过光学字符识别(OCR)提取的特征对匹配精度的贡献最大,其次是图像特征。研究发现,仅音频转录文本即可为对齐提供有价值的信息,在缺乏OCR数据时尤为有益。不同讲座间的匹配精度差异揭示了视频质量与讲座风格带来的挑战。新型多模态算法展现出对部分挑战的鲁棒性,凸显了该方法的潜在应用价值。