We propose LASER, a neuro-symbolic approach to learn semantic video representations that capture rich spatial and temporal properties in video data by leveraging high-level logic specifications. In particular, we formulate the problem in terms of alignment between raw videos and spatio-temporal logic specifications. The alignment algorithm leverages a differentiable symbolic reasoner and a combination of contrastive, temporal, and semantics losses. It effectively and efficiently trains low-level perception models to extract fine-grained video representation in the form of a spatio-temporal scene graph that conforms to the desired high-level specification. In doing so, we explore a novel methodology that weakly supervises the learning of video semantic representations through logic specifications. We evaluate our method on two datasets with rich spatial and temporal specifications: 20BN-Something-Something and MUGEN. We demonstrate that our method learns better fine-grained video semantics than existing baselines.
翻译:本文提出LASER,一种神经符号方法,通过利用高层逻辑规范来学习捕捉视频数据中丰富时空属性的语义视频表示。具体而言,我们将问题形式化为原始视频与时空逻辑规范之间的对齐过程。该对齐算法利用可微符号推理器,并结合对比损失、时序损失和语义损失,高效且有效地训练低层感知模型,以提取符合所需高层规范的精细粒度视频表示——即时空场景图。通过这一方法,我们探索了一种通过逻辑规范弱监督学习视频语义表示的新范式。我们在两个具有丰富时空规范的数据集(20BN-Something-Something和MUGEN)上评估了所提方法,结果表明,相较于现有基线方法,我们的方法能够学习到更优的精细粒度视频语义。