We propose LASER, a neuro-symbolic approach to learn semantic video representations that capture rich spatial and temporal properties in video data by leveraging high-level logic specifications. In particular, we formulate the problem in terms of alignment between raw videos and spatio-temporal logic specifications. The alignment algorithm leverages a differentiable symbolic reasoner and a combination of contrastive, temporal, and semantics losses. It effectively and efficiently trains low-level perception models to extract a fine-grained video representation in the form of a spatio-temporal scene graph that conforms to the desired high-level specification. To practically reduce the manual effort of obtaining ground truth labels, we derive logic specifications from captions by employing a large language model with a generic prompting template. In doing so, we explore a novel methodology that weakly supervises the learning of spatio-temporal scene graphs with widely accessible video-caption data. We evaluate our method on three datasets with rich spatial and temporal specifications: 20BN-Something-Something, MUGEN, and OpenPVSG. We demonstrate that our method learns better fine-grained video semantics than existing baselines.
翻译:我们提出LASER,一种神经符号方法,通过利用高层逻辑规范学习捕捉视频数据中丰富时空属性的语义视频表征。具体而言,我们将问题形式化为原始视频与时空逻辑规范之间的对齐过程。该对齐算法采用可微符号推理器,并结合对比损失、时序损失与语义损失。该方法能高效训练底层感知模型,以提取符合目标高层规范的细粒度视频表征——即时空场景图。为切实降低获取真实标注的人力成本,我们利用大型语言模型配合通用提示模板,从视频描述中推导逻辑规范。通过这一方式,我们探索了利用广泛可获取的视频-描述数据对时空场景图进行弱监督学习的新方法论。我们在三个具有丰富时空规范的数据集(20BN-Something-Something、MUGEN及OpenPVSG)上评估了该方法,结果表明其学习的细粒度视频语义优于现有基线方法。