In this paper, we consider the problem of temporally aligning the video and texts from instructional videos, specifically, given a long-term video, and associated text sentences, our goal is to determine their corresponding timestamps in the video. To this end, we establish a simple, yet strong model that adopts a Transformer-based architecture with all texts as queries, iteratively attending to the visual features, to infer the optimal timestamp. We conduct thorough experiments to investigate: (i) the effect of upgrading ASR systems to reduce errors from speech recognition, (ii) the effect of various visual-textual backbones, ranging from CLIP to S3D, to the more recent InternVideo, (iii) the effect of transforming noisy ASR transcripts into descriptive steps by prompting a large language model (LLM), to summarize the core activities within the ASR transcript as a new training dataset. As a result, our proposed simple model demonstrates superior performance on both narration alignment and procedural step grounding tasks, surpassing existing state-of-the-art methods by a significant margin on three public benchmarks, namely, 9.3% on HT-Step, 3.4% on HTM-Align and 4.7% on CrossTask. We believe the proposed model and dataset with descriptive steps can be treated as a strong baseline for future research in temporal video-text alignment. All codes, models, and the resulting dataset will be publicly released to the research community.
翻译:本文研究教学视频中视频与文本的时序对齐问题,具体而言,给定一个长视频及其关联的文本句子,我们的目标是确定这些句子在视频中对应的时间戳。为此,我们构建了一个简洁而强大的模型,采用基于Transformer的架构,以所有文本作为查询,通过迭代关注视觉特征来推断最优时间戳。我们开展了全面的实验以探究:(i) 升级ASR系统以减少语音识别误差的影响;(ii) 从CLIP、S3D到最新InternVideo等各类视觉-文本骨干网络的影响;(iii) 通过提示大型语言模型(LLM)将含噪ASR转录文本转换为描述性步骤,并将ASR转录中的核心活动总结为新的训练数据集。实验结果表明,我们提出的简单模型在叙述对齐和程序步骤定位任务上均展现出优越性能,在HT-Step、HTM-Align和CrossTask三个公开基准上分别显著超越现有最先进方法9.3%、3.4%和4.7%。我们相信,所提出的模型及其包含描述性步骤的数据集可作为时序视频-文本对齐领域未来研究的有力基线。所有代码、模型及生成的数据集将向研究社区公开。