We propose LASER, a neuro-symbolic approach to learn semantic video representations that capture rich spatial and temporal properties in video data by leveraging high-level logic specifications. In particular, we formulate the problem in terms of alignment between raw videos and spatio-temporal logic specifications. The alignment algorithm leverages a differentiable symbolic reasoner and a combination of contrastive, temporal, and semantics losses. It effectively and efficiently trains low-level perception models to extract fine-grained video representation in the form of a spatio-temporal scene graph that conforms to the desired high-level specification. In doing so, we explore a novel methodology that weakly supervises the learning of video semantic representations through logic specifications. We evaluate our method on two datasets with rich spatial and temporal specifications: 20BN-Something-Something and MUGEN. We demonstrate that our method learns better fine-grained video semantics than existing baselines.
翻译:我们提出LASER,这是一种神经符号方法,通过利用高层逻辑规范学习捕获视频数据中丰富时空属性的语义视频表征。具体而言,我们将该问题表述为原始视频与时空逻辑规范之间的对齐问题。该对齐算法采用可微符号推理器,结合对比损失、时序损失和语义损失三类损失函数,高效训练底层感知模型以提取符合高层逻辑规范的细粒度视频表征——即时空场景图。通过这一过程,我们探索了一种基于逻辑规范弱监督学习视频语义表征的新方法论。我们在两个包含丰富时空规范的数据集(20BN-Something-Something和MUGEN)上评估了该方法,结果表明我们的方法在细粒度视频语义学习方面优于现有基线方法。