Modern AI applications involving video, such as video-text alignment, video search, and video captioning, benefit from a fine-grained understanding of video semantics. Existing approaches for video understanding are either data-hungry and need low-level annotation, or are based on general embeddings that are uninterpretable and can miss important details. We propose LASER, a neuro-symbolic approach that learns semantic video representations by leveraging logic specifications that can capture rich spatial and temporal properties in video data. In particular, we formulate the problem in terms of alignment between raw videos and specifications. The alignment process efficiently trains low-level perception models to extract a fine-grained video representation that conforms to the desired high-level specification. Our pipeline can be trained end-to-end and can incorporate contrastive and semantic loss functions derived from specifications. We evaluate our method on two datasets with rich spatial and temporal specifications: 20BN-Something-Something and MUGEN. We demonstrate that our method not only learns fine-grained video semantics but also outperforms existing baselines on downstream tasks such as video retrieval.
翻译:摘要:现代涉及视频的人工智能应用(如视频-文本对齐、视频搜索和视频字幕生成)受益于对视频语义的细粒度理解。现有的视频理解方法要么数据需求量大且需要低级标注,要么基于不可解释且可能遗漏重要细节的通用嵌入。我们提出LASER,一种通过利用能够捕获视频数据中丰富时空属性的逻辑规范来学习语义视频表示的神经符号方法。具体而言,我们将问题表述为原始视频与规范之间的对齐。该对齐过程高效训练低级感知模型,以提取符合所需高级规范的细粒度视频表示。我们的流水线可进行端到端训练,并能整合从规范中导出的对比损失与语义损失函数。我们在两个具有丰富时空规范的数据集(20BN-Something-Something和MUGEN)上评估了该方法。结果表明,我们的方法不仅学习了细粒度视频语义,而且在视频检索等下游任务中优于现有基线方法。